Histograms

Published on 23 Mar 2004
Tags #Statistics

A series of measurements $x_i = x_1, \dots, x_n$ is a one dimensional list or array which is by nature very space inefficient to store. A histogram is a two dimensional data structure that can be configured to a custom trade off between space and accuracy. The values are sorted into bucket according to their size.

There are two properties controlling the trade off that a histogram represents:

Granularity $g$ . This is the width of the individual buckets. It controls how much accuracy is lost due to rounding.
Buckets $m$ . The number of buckets that the histogram consists of. It denotes the maximum value that can be recorded in the histogram.

To construct the histogram, each individual value $x_i$ is assigned to a bucket $b$ : $b = \lfloor \frac{x_i}{g}\rfloor$ . Although the original series of measurements cannot be reconstructed, an approximation can be generated from the histogram:

Calculate the value that a bucket corresponds to: $x_b = b*g$
The value $x_b$ has to be inserted zero or more times corresponding to the number of values in the bucket.

Due to the fact that each bucket of a histogram contains an absolute number (i.e. the number of measurements of the corresponding magnitude), it is very useful for visualizing and analyzing the values in a series of measurements. Outliers can be easily identified by looking for buckets with an exceptionally high or low number of values.

NOTE: A histogram is not suitable for comparing two or more series of measurements because of its absolute nature. Distributions are a better alternative.

Feedback is always welcome! If you'd like to get in touch with me concerning the contents of this article, please use Twitter.

Nicholas Dille

Histograms

Related Posts

Talk about automated dependency upates using #Renovate @devsmeetup31 Jan 2024

New two day workshop about #GitLab CI (German)30 Nov 2023

Workshop about operating #GitLab (German)23 Nov 2023