Histograms
Published on 23 Mar 2004Tags #Statistics
A series of measurements $x_i = x_1, \dots, x_n$ is a one dimensional list or array which is by nature very space inefficient to store. A histogram is a two dimensional data structure that can be configured to a custom trade off between space and accuracy. The values are sorted into bucket according to their size.
There are two properties controlling the trade off that a histogram represents:
-
Granularity
$g$. This is the width of the individual buckets. It controls how much accuracy is lost due to rounding. -
Buckets
$m$. The number of buckets that the histogram consists of. It denotes the maximum value that can be recorded in the histogram.
To construct the histogram, each individual value $x_i$ is assigned to a bucket $b$: $b = \lfloor \frac{x_i}{g}\rfloor$. Although the original series of measurements cannot be reconstructed, an approximation can be generated from the histogram:
-
Calculate the value that a bucket corresponds to:
$x_b = b*g$ -
The value
$x_b$has to be inserted zero or more times corresponding to the number of values in the bucket.
Due to the fact that each bucket of a histogram contains an absolute number (i.e. the number of measurements of the corresponding magnitude), it is very useful for visualizing and analyzing the values in a series of measurements. Outliers can be easily identified by looking for buckets with an exceptionally high or low number of values.
NOTE: A histogram is not suitable for comparing two or more series of measurements because of its absolute nature. Distributions are a better alternative.