Will eventually provide compact, extensible, modular and performant histogramming functionality.
File-based I/O can be achieved through the standard Java built-in serialization mechanism. All classes implement the {@link java.io.Serializable} interface. However, the toolkit is entirely decoupled from advanced I/O and visualisation techniques. It provides data structures and algorithms only.
This toolkit borrows many fundamental concepts and terminology from the CERN HTL package (C++) largely written by Savrak Sar. HTL is the successor to the CERN HistOOgrams package (C++). Even this manual considerably borrows from HTL's Users Guide.
The definition of an abstract histogram interface allows functionality that is provided by external packages, such as plotting or fitting, to be decoupled from the actual implementation of the histogram. This feature paves the way for co-existence of different histogram packages that conform to the abstract interface.
FixedPartition partition = new FixedPartition(10, 0.0, 100.0); |
double[] boundaries = { 0.0, 10.0, 40.0, 49.0, 50.0, 51.0, 60.0, 100.0 }; VariablePartition partition = new VariablePartition(boundaries); |
FixedPartition partition = new FixedPartition(2, 0.0, 20.0); ... partition.{@link hep.aida.bin.AbstractPartition#noOfBins noOfBins()}; // Obtain the number of bins, partition.{@link hep.aida.bin.AbstractPartition#min min(i)}; // and the lower bound of bin i partition.{@link hep.aida.bin.AbstractPartition#binWidth binWidth(i)}; // and its width partition.{@link hep.aida.bin.AbstractPartition#max max(i)}; // and its upper bound double point = 1.23; int binIndex = partition.{@link hep.aida.bin.AbstractPartition#binOf binOf(point)}; // Obtain index of bin the point falls into (maps to) |
In this package, a histogram delegates to its partitions the task of locating a bin. In other words, information about the lower and upper bounds of a bin or the width of a given bin are obtained from the corresponding partition. This is shown in the following code fragment, which demonstrates how the lower and upper bound and width of a given bin can be obtained.
Histo1D histo = new Histo1D("Histo1D", 10, 0.0, 100.0 ); ... histo.{@link hep.aida.bin.Grid1D#getPartition getPartition()}.noOfBins() // Obtain the number of bins histo.getPartition().min(i) // and the lower bound of bin i histo.getPartition().binWidth(i) // and its width histo.getPartition().max(i) // and its upper bound |
For example if a partition is defined to be new FixedPartition(2, 0.0, 20.0), implicitly 2 outlier ranges are defined, one for underflow and one for overflow. Thus the partition has 4 bins instead of 2 bins, as might be expected: partition.noOfBins()==4. Its boundaries are [Double.NEGATIVE_INFINITY,0.0), [0.0, 10.0), [10.0, 20.0), [20.0, Double.POSITIVE_INFINITY]. As a consequence point -5.0 maps to bin index 0, point 5.0 maps to bin index 1, 15.0 maps to bin index 2 and 25.0 maps to bin index 3. The partition is said to consist of 4 bins, comprising 2 domain bins and 2 outlier bins.
As a further example, consider the case of a partition implicitly defining 1 outlier range for underflow: new VariablePartition(new double[] { 10.0, 20.0, Double.POSITIVE_INFINITY }). The partition has 3 bins instead of 2 bins, as might be expected: partition.noOfBins()==3. Its boundaries are [Double.NEGATIVE_INFINITY,10.0), [10.0, 20.0), [20.0, Double.POSITIVE_INFINITY]. Point 5.0 maps to bin index 0, point 15.0 maps to bin index 1 and 25.0 maps to bin index 2. The partition is said to consist of 3 bins, comprising 1 domain bin and 2 outlier bins.
As can be seen, underflow outlier bins, if present, always have an index of 0. Overflow outlier bins, if present, always have an index of partition.noOfBins()-1. A partition can be asked whether it has underflow or overflow outlier bins:
partition.hasUnderflowBin(); partition.hasOverflowBin(); partition.isUnderflowBin(i); partition.isOverflowBin(i); partition.isOutlierBin(i); |
Bins themselves contain information about the data filled into them. They can be asked for various descriptive statistical measures, such as the minimum, maximum, size, mean, rms, variance, etc.
Note that bins (of any kind) only know about their contents. They do not know where they are are located in the histogram to which they belong, nor about their widths or bounds - this information is stored in the partition to which they belong, which also defines the bin layout within a histogram.
Bins come in two flavours: Dynamic and Static. Dynamic bins preserve
all the values filled into them and can return these exact values, when asked
to do so. They are rebinnable.
Static bins do not preserve the values filled into them. They merely collect
basic statistics incrementally while they are being filled. They immediately
forget about the filled values and keep only the derived statistics. They are
not rebinnable.
The data filled into static bins is not preserved. As a consequence infinitely many elements can be added to such bins. As a further consequence such bins cannot compute more than basic statistics. They are also not rebinnable. If these drawbacks matter, consider to use a {@link hep.aida.bin.DynamicBin1D}, which overcomes them at the expense of increased memory requirements.
The data filled into dynamic bins is fully preserved. Technically speaking, they are k-dimensional multisets (or bags) with efficient statistics operations defined upon. As a consequence such bins can compute more than only basic statistics. They are also rebinnable. On the other hand side, if many elements are filled into them, one may quickly run out of memory (each double element takes 8 bytes). If these drawbacks matter, consider to use a {@link hep.aida.bin.StaticBin1D}, which overcomes them at the expense of limited functionality.
Static 1-dimensional bins currently offered are:
{@link hep.aida.bin.StaticBin1D}, {@link hep.aida.bin.MightyStaticBin1D} and {@link hep.aida.bin.QuantileBin1D}.
Dynamic 1-dimensional bins currently offered are:
{@link hep.aida.bin.DynamicBin1D}.
2-dimensional (or higher dimensional) bins consist of a set of 1-dimensional bins and joint statistics. Again, 2-dimensional (or higher dimensional) bins can be static or dynamic, depending on whether their contained 1-dimensional bins are static or dynamic.
2-dimensional bins currently offered are: {@link hep.aida.bin.WeightedBin2D} and {@link hep.aida.bin.GravityBin2D}.
Currently the following grids are offered: {@link hep.aida.bin.Grid1D} and {@link hep.aida.bin.Grid2D}, all derived from a common abstract base class {@link hep.aida.bin.AbstractGrid}.
A grid contains domain bins (consuming points falling in the defined domain), which are usually entirelly surrounded by outlier bins, consuming outliers. A grid can additionally contain another special bin, the universe bin, consuming all points regardless of their particular coordinates.
Grid2D = new Grid2D(new FixedPartition(2, 0.0, 100.0), new FixedPartition(2, 0.0, 100.0), ...); y ^ d ... domain bin (in-range bin), o ... outlier bin | +inf | | o | o | o | o 100 - --------------- | o | d | d | o --> domain == [0,100]2 | --------------- --> universe == [-infinity,+infinity]2 | o | d | d | o --> outlier space == universe - domain 0 - --------------- | o | o | o | o -inf| -----|-------|------> x -inf 0 100 +inf |
Bin(3,3) with a domain of [100,+inf], [100,+inf] is an outlier bin, which is a space within the universe minus the space defined upon partition construction. Outlier bins consume outlier points.
Remark: Usually, in a 1-dimensional grid there are n domain bins and additionally 2 outlier bins (underflow, overflow). Usually, in a 2-dimensional grid there are n*m domain bins and additionally (n+2)*(m+2) - n*m outlier bins.
Large scale data analysis and simulation applications may require a long time to finish. Yet, it is desirable to view intermediate histogramming results long before the last data element has been filled. Thus, filling and viewing is done concurrently.
For performance reasons, the same applications may exploit multiprocessor environments by splitting a problem into subproblems, assigning each subproblem to a processor (thread). In such a case one shared histogram is often filled concurrently by multiple threads.
This toolkit is fully thread safe. All public methods are synchronized. Thus, one can have one or more threads filling a shared histogram, grid, bin, etc. as well as one or more threads reading and viewing shared statistics at the same time, in particular, while filling.
Thread safety introduces no additional complexity for the user. If a histogram, grid, bin is used by a thread it is automatically locked for other threads. Those other threads are automatically queued and automatically resume work upon lock release. Thus, a user need not use special tricks or obey complicated protocols. Thread safety is built in and comes for free. Here is a snippet demonstrating how to use 3 plain normal Java threads concurrently filling a shared histogram:
Grid1D grid = new Grid1D(new FixedPartition(2, 0.0, 100.0), new WeightedBin2D()); Histo1D histo = new Histo1D("my histo", grid, null); // construct and start 3 writer threads filling the histogram Writer a = new Writer(histo); Writer b = new Writer(histo); Writer c = new Writer(histo); a.start(); b.start(); c.start(); // wait until all writers are finished try { a.join(); } catch (InterruptedException exc) {} try { b.join(); } catch (InterruptedException exc) {} try { c.join(); } catch (InterruptedException exc) {} // display results System.out.println(histo); System.exit(0); // now the definition of a filling thread... class Writer extends {@link java.lang.Thread Thread} { private Histo1D histo; public Writer(Histo1D histo) { // constructor this.histo = histo; } public void run() { // automatically called upon thread start for (int i=0; i<1000000; i++) { histo.add(i,i*i); } } } |
... // want to take a snapshot. double variance, rms; int size; synchronized (histo) { // locking anybody else out; entering atomic block variance = histo.variance(); size = histo.size(); rms = histo.rms(); } // finished taking a snapshot; other threads may again work upon histo. System.out.println("var="+variance+", size="+size+", rms="+rms); |
Note that it is not required to use buffered filling. It is merely provided as an optional facility to boost performance.
Buffered filling helps to keep a design flexible, modular and extensible while still providing high performance. It is, for example, used in I/O subsystems. The Java I/O subsystem classes {@link java.io.BufferedInputStream} and {@link java.io.BufferedOutputStream} are built around the same pattern.
In buffered filling, data is piecewise added to a fixed medium to large sized container and upon overflow of the container automatically transferred (flushed) as a whole to a target consumer object. Filling data in chunks via addAllOf methods rather than piecewise via add methods results in very few method invocations despite complex and deeply layered call chains. Additionally cache locality is greatly improved, which increasingly matters given that CPUs and their on-board caches are getting quicker at a much greater rate than normal RAM chips and buses. Further, organizational tasks carried out by methods such as validity checks, preprocessing, etc. are minimized.
In this toolkit, buffered filling is conveniently supported by the tuned classes {@link cern.colt.buffer.DoubleBuffer}, {@link cern.colt.buffer.DoubleBuffer2D} and {@link cern.colt.buffer.DoubleBuffer3D}. Simply connect a buffer to a histogram, grid or bin. Now instead of filling data via histogram.add(...) fill it to the buffer via buffer.add(...). The buffer will take care of the rest. Just remember one important thing: Always flush a buffer when finished with filling! If one forgets to call buffer.flush() in the end, some elements may remain in the buffer and never appear in the histogram! The following code snipped demonstrates proper usage:
Histo1D histo = new Histo1D("my histo",new FixedPartition(2, 0.0, 100.0)); DoubleBuffer2D buffer = new DoubleBuffer2D(histo,10000); // buffer capacity = 10000 points. for (int i=1000000; --i >= 0; ) { buffer.{@link cern.colt.buffer.DoubleBuffer2D#add add(i,i)}; // could also use hist.add(i,i), but would be inefficient } buffer.{@link cern.colt.buffer.DoubleBuffer2D#flush flush()}; System.out.println(histo); |
This package is fully functional without buffers. However, it is designed such that it delivers high performance only when using buffered filling. Try and use buffers, if performance matters!
When using multiple concurrent writer threads together with buffered filling, each writer must have its own buffer. Why? While histograms, grids, bins etc. are fully thread safe (synchronized), buffers are not. Thus, if all threads would share the same buffer and write to it at the same time without using any mutual exclusion protocol, they could corrupt the buffer. Here is a snippet demonstrating proper usage:
Grid1D grid = new Grid1D(new FixedPartition(2, 0.0, 100.0), new WeightedBin2D()); Histo1D histo = new Histo1D("my histo", grid, null); // construct and start 3 writer threads filling the histogram Writer a = new Writer(histo); Writer b = new Writer(histo); Writer c = new Writer(histo); a.start(); b.start(); c.start(); // wait until all writers are finished try { a.join(); } catch (InterruptedException exc) {} try { b.join(); } catch (InterruptedException exc) {} try { c.join(); } catch (InterruptedException exc) {} // display results System.out.println(histo); System.exit(0); // now the definition of a filling thread... class Writer extends {@link java.lang.Thread Thread} { private DoubleBuffer2D buffer; // my own private buffer, no other thread can see and touch it. public Writer(Histo1D histo) { // constructor this.buffer = new DoubleBuffer2D(histo,10000); } public void run() { // automatically called upon thread start for (int i=0; i<1000000; i++) { buffer.add(i,i*i); } buffer.flush(); } } |
TODO
In case not each and every statistics measure needed is directly provided by methods of histograms, grids or bins one can use dynamic bins and retrieve their filled elements. From these elements, one can compute whatever necessary, either by using a statistics library or self written functions. Use methods {@link hep.aida.bin.Histo#getGrid}, {@link hep.aida.bin.Grid#getBin}, {@link hep.aida.bin.DynamicBin1D#elements()} and the descriptive statistics library {@link cern.jet.stat.Descpritive}.
TODO