Histograms
Published on 23 Mar 2004Tags #Statistics
A series of measurements $x_i = x_1, \dots, x_n$
is a one dimensional list or array which is by nature very space inefficient to store. A histogram is a two dimensional data structure that can be configured to a custom trade off between space and accuracy. The values are sorted into bucket according to their size.
There are two properties controlling the trade off that a histogram represents:
-
Granularity
$g$
. This is the width of the individual buckets. It controls how much accuracy is lost due to rounding. -
Buckets
$m$
. The number of buckets that the histogram consists of. It denotes the maximum value that can be recorded in the histogram.
To construct the histogram, each individual value $x_i$
is assigned to a bucket $b$
: $b = \lfloor \frac{x_i}{g}\rfloor$
. Although the original series of measurements cannot be reconstructed, an approximation can be generated from the histogram:
-
Calculate the value that a bucket corresponds to:
$x_b = b*g$
-
The value
$x_b$
has to be inserted zero or more times corresponding to the number of values in the bucket.
Due to the fact that each bucket of a histogram contains an absolute number (i.e. the number of measurements of the corresponding magnitude), it is very useful for visualizing and analyzing the values in a series of measurements. Outliers can be easily identified by looking for buckets with an exceptionally high or low number of values.
NOTE: A histogram is not suitable for comparing two or more series of measurements because of its absolute nature. Distributions are a better alternative.