Status: Under construction; Do not use. This package is not even alpha quality! The only thing rock solid and stable is the inheritance tree rooted in {@link AbstractBin1D}. All the rest is subject to drastic change. This package is only contained in the distribution to stimulate feedback.

Will eventually provide compact, extensible, modular and performant histogramming functionality.

Getting Started

1. Overview

Histo itself offers the histogramming features of HTL and HBOOK, the de-facto standard for histogramming for many years. It also offers a number of useful extensions, with an object-oriented approach. These features include the following:

File-based I/O can be achieved through the standard Java built-in serialization mechanism. All classes implement the {@link java.io.Serializable} interface. However, the toolkit is entirely decoupled from advanced I/O and visualisation techniques. It provides data structures and algorithms only.

This toolkit borrows many fundamental concepts and terminology from the CERN HTL package (C++) largely written by Savrak Sar. HTL is the successor to the CERN HistOOgrams package (C++). Even this manual considerably borrows from HTL's Users Guide.

The definition of an abstract histogram interface allows functionality that is provided by external packages, such as plotting or fitting, to be decoupled from the actual implementation of the histogram. This feature paves the way for co-existence of different histogram packages that conform to the abstract interface.

2. Histo at a glance

TODO

3. Partitions, Bins and Grids

3.1. Partitions

A partition describes how one dimension of the problem space is divided into intervals. Consider the case of a 10 bin histogram in the range [0,100]. A partition object containing the number of bins and the interval limits will describe completely how we divide such an interval: a set of 10 sub-intervals of equal width. This is termed a {@link hep.aida.bin.FixedPartition} and can be constructed as follows
FixedPartition partition = new FixedPartition(10, 0.0, 100.0); 
It may be required to work with an histogram over the same range as the example above, but with bins of variable widths. In this case, a partition containing the number of bins, the lower limit of each sub-interval and the upper limit of the last sub-interval will describe completely how the interval [0,100] is divided. Such a partition is termed a {@link hep.aida.bin.VariablePartition} and can be constructed as follows
double[] boundaries = { 0.0, 10.0, 40.0, 49.0, 50.0, 51.0, 60.0, 100.0 };
VariablePartition partition = new VariablePartition(boundaries); 
An n-dimensional histogram thus contains n partitions, one for each axis. The only concern of a partition is to associate any ordered 1D space with a discrete numbered space. Thus it associates an interval to an integer. Hence, a partition knows about the width of the intervals and their lower point/bound or upper point/bound. A partition can be asked for such information as follows:
FixedPartition partition = new FixedPartition(2, 0.0, 20.0);
... 
partition.{@link hep.aida.bin.AbstractPartition#noOfBins noOfBins()};    // Obtain the number of bins, 
partition.{@link hep.aida.bin.AbstractPartition#min min(i)};        // and the lower bound of bin i
partition.{@link hep.aida.bin.AbstractPartition#binWidth binWidth(i)};   // and its width
partition.{@link hep.aida.bin.AbstractPartition#max max(i)};        // and its upper bound
double point = 1.23;
int binIndex = partition.{@link hep.aida.bin.AbstractPartition#binOf binOf(point)}; // Obtain index of bin the point falls into (maps to)

In this package, a histogram delegates to its partitions the task of locating a bin. In other words, information about the lower and upper bounds of a bin or the width of a given bin are obtained from the corresponding partition. This is shown in the following code fragment, which demonstrates how the lower and upper bound and width of a given bin can be obtained.
Histo1D histo = new Histo1D("Histo1D", 10, 0.0, 100.0 ); 
... 
histo.{@link hep.aida.bin.Grid1D#getPartition getPartition()}.noOfBins()    // Obtain the number of bins
histo.getPartition().min(i)        // and the lower bound of bin i
histo.getPartition().binWidth(i)   // and its width
histo.getPartition().max(i)        // and its upper bound
A partition always sucessfully maps any arbitrary point drawn from the universe [-infinity,+infinity] to a bin index. If a partition, as in the previous examples, is constructed in a subrange of the universe, it implicitly defines one or two additional ranges for outliers.

For example if a partition is defined to be new FixedPartition(2, 0.0, 20.0), implicitly 2 outlier ranges are defined, one for underflow and one for overflow. Thus the partition has 4 bins instead of 2 bins, as might be expected: partition.noOfBins()==4. Its boundaries are [Double.NEGATIVE_INFINITY,0.0), [0.0, 10.0), [10.0, 20.0), [20.0, Double.POSITIVE_INFINITY]. As a consequence point -5.0 maps to bin index 0, point 5.0 maps to bin index 1, 15.0 maps to bin index 2 and 25.0 maps to bin index 3. The partition is said to consist of 4 bins, comprising 2 domain bins and 2 outlier bins.

As a further example, consider the case of a partition implicitly defining 1 outlier range for underflow: new VariablePartition(new double[] { 10.0, 20.0, Double.POSITIVE_INFINITY }). The partition has 3 bins instead of 2 bins, as might be expected: partition.noOfBins()==3. Its boundaries are [Double.NEGATIVE_INFINITY,10.0), [10.0, 20.0), [20.0, Double.POSITIVE_INFINITY]. Point 5.0 maps to bin index 0, point 15.0 maps to bin index 1 and 25.0 maps to bin index 2. The partition is said to consist of 3 bins, comprising 1 domain bin and 2 outlier bins.

As can be seen, underflow outlier bins, if present, always have an index of 0. Overflow outlier bins, if present, always have an index of partition.noOfBins()-1. A partition can be asked whether it has underflow or overflow outlier bins:
partition.hasUnderflowBin();
partition.hasOverflowBin();

partition.isUnderflowBin(i);
partition.isOverflowBin(i);
partition.isOutlierBin(i);

3.2. Bins

Bins themselves contain information about the data filled into them. They can be asked for various descriptive statistical measures, such as the minimum, maximum, size, mean, rms, variance, etc.

Note that bins (of any kind) only know about their contents. They do not know where they are are located in the histogram to which they belong, nor about their widths or bounds - this information is stored in the partition to which they belong, which also defines the bin layout within a histogram.

 

Static vs. Dynamic bins and their trade-offs

Bins come in two flavours: Dynamic and Static. Dynamic bins preserve all the values filled into them and can return these exact values, when asked to do so. They are rebinnable.
Static bins do not preserve the values filled into them. They merely collect basic statistics incrementally while they are being filled. They immediately forget about the filled values and keep only the derived statistics. They are not rebinnable.

The data filled into static bins is not preserved. As a consequence infinitely many elements can be added to such bins. As a further consequence such bins cannot compute more than basic statistics. They are also not rebinnable. If these drawbacks matter, consider to use a {@link hep.aida.bin.DynamicBin1D}, which overcomes them at the expense of increased memory requirements.

The data filled into dynamic bins is fully preserved. Technically speaking, they are k-dimensional multisets (or bags) with efficient statistics operations defined upon. As a consequence such bins can compute more than only basic statistics. They are also rebinnable. On the other hand side, if many elements are filled into them, one may quickly run out of memory (each double element takes 8 bytes). If these drawbacks matter, consider to use a {@link hep.aida.bin.StaticBin1D}, which overcomes them at the expense of limited functionality.

Available bin types

All bins are derived from a common abstract base class {@link hep.aida.bin.AbstractBin}. This base class is extended by {@link hep.aida.bin.AbstractBin1D}, the common abstract base class for 1-dimensional bins. It is also extended by {@link hep.aida.bin.AbstractBin2D}, the common abstract base class for 2-dimensional bins.

Static 1-dimensional bins currently offered are: {@link hep.aida.bin.StaticBin1D}, {@link hep.aida.bin.MightyStaticBin1D} and {@link hep.aida.bin.QuantileBin1D}.
Dynamic 1-dimensional bins currently offered are: {@link hep.aida.bin.DynamicBin1D}.

2-dimensional (or higher dimensional) bins consist of a set of 1-dimensional bins and joint statistics. Again, 2-dimensional (or higher dimensional) bins can be static or dynamic, depending on whether their contained 1-dimensional bins are static or dynamic.

2-dimensional bins currently offered are: {@link hep.aida.bin.WeightedBin2D} and {@link hep.aida.bin.GravityBin2D}.

3.3. Grids

With grids one can simultaneously determine statistics over subdomains (bins) and the entire domain (universe). They are a combination of partitions and k-dimensional matrices of bins, mapping continous points to bins. Grids can be 1,2,3 or n-dimensional. A 1-dimensional grid has one partition and a list of 1-dimensional bins. A 2-dimensional grid has two partitions (one for each axis) and a 2-dimensional matrix of 2-dimensional bins. A n-dimensional grid has n partitions and a n-dimensional matrix of n-dimensional bins. It maps n-dimensional points to n-dimensional bins.

Currently the following grids are offered: {@link hep.aida.bin.Grid1D} and {@link hep.aida.bin.Grid2D}, all derived from a common abstract base class {@link hep.aida.bin.AbstractGrid}.

A grid contains domain bins (consuming points falling in the defined domain), which are usually entirelly surrounded by outlier bins, consuming outliers. A grid can additionally contain another special bin, the universe bin, consuming all points regardless of their particular coordinates.

Domain bins and Outlier bins

As already desribed, a partition usually maps underflow (left,bottom) and overflow (right,top) outliers to bin indexes 0 and noOfBins-1, respectively. A grid contains domain bins (consuming points falling in the defined domain), which are usually entirelly surrounded by outlier bins, consuming outliers. Here is an example {@link hep.aida.bin.Grid2D} using two partitions, one for the x-axis and another for the y-axis:
 Grid2D = new Grid2D(new FixedPartition(2, 0.0, 100.0), new FixedPartition(2, 0.0, 100.0), ...);

	   y ^                           d ... domain bin (in-range bin), o ... outlier bin
	     |                           
	+inf |                           
	     |   o | o | o | o           
	 100 -  ---------------
	     |   o | d | d | o          --> domain == [0,100]2
	     |  ---------------         --> universe == [-infinity,+infinity]2
	     |   o | d | d | o          --> outlier space == universe - domain
	   0 -  ---------------         
	     |   o | o | o | o          
	 -inf|  
	      -----|-------|------> x
	      -inf 0      100   +inf
Bin(1,1) with a domain of [0,50), [0,50) is a domain bin (or in-range bin), which is a space defined upon partition construction. Domain bins consume domain points.

Bin(3,3) with a domain of [100,+inf], [100,+inf] is an outlier bin, which is a space within the universe minus the space defined upon partition construction. Outlier bins consume outlier points.

Remark: Usually, in a 1-dimensional grid there are n domain bins and additionally 2 outlier bins (underflow, overflow). Usually, in a 2-dimensional grid there are n*m domain bins and additionally (n+2)*(m+2) - n*m outlier bins.

Universe

A grid can contain a special bin, the universe bin. The universe bin consumes all points regardless of their particular coordinates. It therefore provides aggregate statistical information over the entire data sample. It can be seen as a histogram with one bin only, with the full domain of [-infinity,+infinity]. Thus, with grids one can simultaneously determine statistics over subdomains (bins) and the entire domain (universe).

4. Histograms

TODO

5. Operations on histograms

TODO

Advanced Histogramming

TODO

1. Multiple concurrent writers and readers - Thread safety

This toolkit explicitly supports the requirements of online monitoring, large scale data analysis and simulation applications. Online monitoring applications are never done; they conceptually fill an infinite number of entries into histograms. Yet, they need to present statistics measures of the data collected so far. Thus, filling and viewing of histograms is done concurrently.

Large scale data analysis and simulation applications may require a long time to finish. Yet, it is desirable to view intermediate histogramming results long before the last data element has been filled. Thus, filling and viewing is done concurrently.

For performance reasons, the same applications may exploit multiprocessor environments by splitting a problem into subproblems, assigning each subproblem to a processor (thread). In such a case one shared histogram is often filled concurrently by multiple threads.

This toolkit is fully thread safe. All public methods are synchronized. Thus, one can have one or more threads filling a shared histogram, grid, bin, etc. as well as one or more threads reading and viewing shared statistics at the same time, in particular, while filling.

Thread safety introduces no additional complexity for the user. If a histogram, grid, bin is used by a thread it is automatically locked for other threads. Those other threads are automatically queued and automatically resume work upon lock release. Thus, a user need not use special tricks or obey complicated protocols. Thread safety is built in and comes for free. Here is a snippet demonstrating how to use 3 plain normal Java threads concurrently filling a shared histogram:
Grid1D grid = new Grid1D(new FixedPartition(2, 0.0, 100.0), new WeightedBin2D());
Histo1D histo = new Histo1D("my histo", grid, null);

// construct and start 3 writer threads filling the histogram
Writer a = new Writer(histo);
Writer b = new Writer(histo);
Writer c = new Writer(histo);
a.start(); b.start(); c.start();

// wait until all writers are finished
try { a.join(); } catch (InterruptedException exc) {}
try { b.join(); } catch (InterruptedException exc) {} 
try { c.join(); } catch (InterruptedException exc) {} 

// display results
System.out.println(histo);
System.exit(0);


// now the definition of a filling thread...
class Writer extends {@link java.lang.Thread Thread} {
	private Histo1D histo;
	
	public Writer(Histo1D histo) { // constructor
		this.histo = histo;
	}
	public void run() { // automatically called upon thread start
		for (int i=0; i<1000000; i++) {
			histo.add(i,i*i);
		}
	}
}

Snapshots

Often, a viewer thread might want to read off a set of statistics measures in a consistent way (taking a snapshot), while other threads are filling at the same time. Even though the toolkit is synchronized and guarantees that during the time one thread executes a public method of a histogram, grid or bin no other thread can possibly enter any method of that object, it is not sufficient to simply read off one measure after the other, because between reading the first measure A and starting to read the second measure B, another thread might get a turn and add new values to the histogram. As a consequence measure A conceptually reflects another dataset than measure B. To prevent this, one may want to entirelly lock out any other thread before taking a snapshot by explicitly synchronizing on the histogram, grid or bin object:
 ...
 // want to take a snapshot.
 double variance, rms;
 int size;
 synchronized (histo) { // locking anybody else out; entering atomic block
 	variance = histo.variance();
 	size = histo.size();
 	rms = histo.rms();
 }
 // finished taking a snapshot; other threads may again work upon histo.
 System.out.println("var="+variance+", size="+size+", rms="+rms);
 
Fully synchronized designs as used in this toolkit have consequences concerning performance. See the next chapter "Performance".

2. Performance

Fully synchronized designs as used in this toolkit have consequences concerning performance. Synchronized message dispatch incurrs, depending on the VM, medium to large performance overheads compared to normal message dispatch. However, high performance can still be achieved by using the well known buffered filling design pattern.

Buffered filling

Note that it is not required to use buffered filling. It is merely provided as an optional facility to boost performance.

Buffered filling helps to keep a design flexible, modular and extensible while still providing high performance. It is, for example, used in I/O subsystems. The Java I/O subsystem classes {@link java.io.BufferedInputStream} and {@link java.io.BufferedOutputStream} are built around the same pattern.

In buffered filling, data is piecewise added to a fixed medium to large sized container and upon overflow of the container automatically transferred (flushed) as a whole to a target consumer object. Filling data in chunks via addAllOf methods rather than piecewise via add methods results in very few method invocations despite complex and deeply layered call chains. Additionally cache locality is greatly improved, which increasingly matters given that CPUs and their on-board caches are getting quicker at a much greater rate than normal RAM chips and buses. Further, organizational tasks carried out by methods such as validity checks, preprocessing, etc. are minimized.

In this toolkit, buffered filling is conveniently supported by the tuned classes {@link cern.colt.buffer.DoubleBuffer}, {@link cern.colt.buffer.DoubleBuffer2D} and {@link cern.colt.buffer.DoubleBuffer3D}. Simply connect a buffer to a histogram, grid or bin. Now instead of filling data via histogram.add(...) fill it to the buffer via buffer.add(...). The buffer will take care of the rest. Just remember one important thing: Always flush a buffer when finished with filling! If one forgets to call buffer.flush() in the end, some elements may remain in the buffer and never appear in the histogram! The following code snipped demonstrates proper usage:

 Histo1D histo = new Histo1D("my histo",new FixedPartition(2, 0.0, 100.0));
 DoubleBuffer2D buffer = new DoubleBuffer2D(histo,10000); // buffer capacity = 10000 points.
 
 for (int i=1000000; --i >= 0; ) {
     buffer.{@link cern.colt.buffer.DoubleBuffer2D#add add(i,i)}; // could also use hist.add(i,i), but would be inefficient
 }
 buffer.{@link cern.colt.buffer.DoubleBuffer2D#flush flush()};
 
 System.out.println(histo);

This package is fully functional without buffers. However, it is designed such that it delivers high performance only when using buffered filling. Try and use buffers, if performance matters!

 

Multiple concurrent writer threads together with buffered filling

When using multiple concurrent writer threads together with buffered filling, each writer must have its own buffer. Why? While histograms, grids, bins etc. are fully thread safe (synchronized), buffers are not. Thus, if all threads would share the same buffer and write to it at the same time without using any mutual exclusion protocol, they could corrupt the buffer. Here is a snippet demonstrating proper usage:

Grid1D grid = new Grid1D(new FixedPartition(2, 0.0, 100.0), new WeightedBin2D());
Histo1D histo = new Histo1D("my histo", grid, null);

// construct and start 3 writer threads filling the histogram
Writer a = new Writer(histo);
Writer b = new Writer(histo);
Writer c = new Writer(histo);
a.start(); b.start(); c.start();

// wait until all writers are finished
try { a.join(); } catch (InterruptedException exc) {}
try { b.join(); } catch (InterruptedException exc) {} 
try { c.join(); } catch (InterruptedException exc) {} 

// display results
System.out.println(histo);
System.exit(0);


// now the definition of a filling thread...
class Writer extends {@link java.lang.Thread Thread} {
	private DoubleBuffer2D buffer; // my own private buffer, no other thread can see and touch it.
	
	public Writer(Histo1D histo) { // constructor
		this.buffer = new DoubleBuffer2D(histo,10000);
	}
	
	public void run() { // automatically called upon thread start
		for (int i=0; i<1000000; i++) {
			buffer.add(i,i*i);
		}
		buffer.flush();
	}
}

Benchmarks

TODO

3. Advanced statistics on dynamic bins

In case not each and every statistics measure needed is directly provided by methods of histograms, grids or bins one can use dynamic bins and retrieve their filled elements. From these elements, one can compute whatever necessary, either by using a statistics library or self written functions. Use methods {@link hep.aida.bin.Histo#getGrid}, {@link hep.aida.bin.Grid#getBin}, {@link hep.aida.bin.DynamicBin1D#elements()} and the descriptive statistics library {@link cern.jet.stat.Descpritive}.

TODO