LESSON 10: Histograms questions
FOCUS QUESTION: How can I understand and compare the distributions of two data sets?
Contents
- EXAMPLE 1: Load the Daphne Island and Santa Cruz Island beak size data
- EXAMPLE 2: Display a histogram of the Daphne Island beak size data
- EXAMPLE 3: Calculate and display a histogram using a bar chart, line graph and stair plot
- EXAMPLE 4: Explore the data cursor feature
- EXAMPLE 5: Use different bin sizes for Daphne Island histograms
- EXAMPLE 6: Compare beak distributions of Daphne and Santa Cruz Islands
- EXAMPLE 7: Calculate histograms with explicit bin positions
- EXAMPLE 8: Use overlapping stair plots to compare scaled distributions
- EXAMPLE 9: Calculate the cumulative distributions for both data sets
- EXAMPLE 10: Plot the cumulative distributions for both data sets
- EXAMPLE 11: Generate "random" numbers from three common probability distributions
- EXAMPLE 12: Display the histograms of the generated distributions
EXAMPLE 1: Load the Daphne Island and Santa Cruz Island beak size data
Daphne = load('DaphneIslandBeaks.txt'); SantaCruz = load('SantaCruzIslandBeaks.txt');
| Questions | Answers |
| Why was the data for the two islands put in separate files? | The two data sets were not the same size, so you could not simply put them into two columns. Although you could use a missing designator to artificially make the two data sets the same size, the measurements are independent. Thus, arranging the two data sets as side-by-side columns in the same file might be misleading. |
EXAMPLE 2: Display a histogram of the Daphne Island beak size data
nDaphne = length(Daphne); % Find number of Daphne Island finches titleDaphne = ['Daphne Island ground finches (n=' num2str(nDaphne) ')']; figure('Name', titleDaphne); % Create a titled figure window hist(Daphne) % Calculate and plot the histogram xlabel('Beak size in mm'); ylabel('Number of birds'); title(titleDaphne); % Use same title for plot as window
| Questions | Answers |
| What is a histogram? | A histogram is a frequency table, that is a table listing how many times each value (or range of values) appears in a data set. |
| What is a frequency table? | A frequency table records how many times each data value occurs in the data set. If the data set only has a small number of values, we keep a count for each possible value. For data sets that contain real numbers or have a large number of possible discrete values, we use a binned frequency table. |
What does the hist function do? |
The hist function computes a binned frequency table
or histogram for the data.
Since the example did have output arguments,
the resulting frequency table is plotted as a bar chart rather than
returned as an array.
|
| What is a binned frequency table? | A binned frequency table divides the possible data values into subranges called bins and counts how many values fall into each bin. |
How many bins does hist use? |
By default, the hist function uses 10 equal-sized bins that
span the range of the data. (You may also explicitly specify the bins
as in later examples.)
|
| Does a histogram always have to be displayed as a bar chart? | No. The bar chart is a common visual representation of a histogram but not the only useful one. |
Does the hist function always display a
figure? |
No. If you use the output arguments, as shown in the next example,
the hist function does not produce a figure.
|
EXAMPLE 3: Calculate and display a histogram using a bar chart, line graph and stair plot
[n, xout] = hist(Daphne); % Calculate histogram but don't display figure hold on bar(xout, n); % Plot histogram using a bar chart plot(xout, n, '-ok') % Plot histogram using a black line graph stairs(xout, n, 'r') % Plot histogram using a red stair plot hold off xlabel('Beak size in mm'); ylabel('Number of birds'); title(titleDaphne); datacursormode on % Turn the data cursor on for exploration
| Questions | Answers |
How does [n, xout] = hist(Daphne) differ from
the hist(Daphne) of EXAMPLE 2? |
When you use output arguments with the hist function,
MATLAB does not draw a figure. Rather the hist
function returns the frequency counts and the centers of the bins.
|
Did I need to assign the result of hist
to variables? |
Yes, if you want to get the values in the frequency table rather then to just see a plot. Use the form with output arguments when you want to do your own display or if you want to compute something else from the frequency table. |
| When would I need the bin positions and counts from a histogram? | This example illustrates using these values to display the histogram in three different ways. You might also want to do further computations on these values or compute a cummulative probability distribution. |
What does '-ok' mean in plot?
|
The '-ok' is shorthand for black
(k) circular markers (o) that are connected
with a solid lines (-).
|
| Are there are other short-cuts for specifying plot characteristics? | Yes. Several more shortcuts appear in this lesson. See
LineSpec in the MATLAB help for a complete list.
|
What is the difference between plot
and stairs? |
The plot function connects each consecutive (x, y)
pair with a straight line. The stairs function
connects each consecutive (x, y) pair with a staircase. MATLAB draws
a horizontal line between the x values at the level of the first
y value. At the second x value, MATLAB draws a vertical line
between the two y values to form a stair.
|
| What is the data cursor? | The data cursor feature allows you to read the coordinates of graph by mousing over the points. |
What does datacursormode on do? |
This command turns on the data cursor on current figure. |
| Can I turn on the data cursor from the plottools? | Yes. Use Tools->Data Cursor on the Figure Window
menubar. Alternatively, use the data cursor icon ( |
EXAMPLE 4: Explore the data cursor feature
EXAMPLE 5: Use different bin sizes for Daphne Island histograms
figure
subplot(3, 1, 1)
hist(Daphne, 10) % Create a plot a 10-bin histogram
title(titleDaphne) % Put title over topmost graph
legend('10 bins')
ylabel('Birds')
subplot(3, 1, 2)
hist(Daphne, 25) % Create a plot a 25-bin histogram
legend('25 bins')
ylabel('Birds')
subplot(3, 1, 3)
hist(Daphne, 100) % Create a plot a 100-bin histogram
legend('100 bins')
xlabel('Beak size in mm')
ylabel('Birds')
| Questions | Answers | What does the 10 represent
in the first call to hist? |
The 10 specifies the number of bins to
use in the frequency table. The default number of bins is 10. So the
first call to hist behaves the same hist(Daphne).
The second call to hist uses 25 bins. Notice that the
bars on the corresponding graph are thinner because more of them
must fit in the same area. |
| Should I always use a large number of bins for a histogram? | Choosing the right bin size is sometimes a tricky trade-off. If you choose too few bins, the poor resolution may hide interesting features. If you choose too many bins, some bins will be sparsely occupied and the histogram may take on a jagged appearance. You may also miss essential features. It is usually good to experiment with the bin size to see what the trade-offs are. | How does MATLAB determine the positions of the bins? | MATLAB divides the range of data values in its first argument
into that number of bins. Your data can't contain +inf
or -inf. |
What happens if I move the xlabel statement
after the first hist
|
The xlabel adds an x-axis label to the current
axis. The top histogram's x-axis will be labeled.
|
What happens if I move the xlabel statement
directly after the first subplot
|
The xlabel adds an x-axis label to the current
axis, which was created by the subplot. However, the
hist function creates a new axis, so the label is lost.
|
EXAMPLE 6: Compare beak distributions of Daphne and Santa Cruz Islands
nSantaCruz = length(SantaCruz); % Find number Santa Cruz Island finches figure subplot(1, 2, 1) hist(Daphne) % Histogram of Daphne Island finches title(['Daphne (n=' num2str(nDaphne) ')']) xlabel('Beak size (mm)') ylabel('Number of birds') % Only use one y label for both axes subplot(1, 2, 2) hist(SantaCruz) % Histogram of Santa Cruz Island finches title(['Santa Cruz (n=' num2str(nSantaCruz) ')']) xlabel('Beak size (mm)')
| Questions | Answers |
| Why are the vertical scales of the two histograms different? | The counts depend on how many values each data set has. These data sets are of different size. |
| Why are the horizontal scales of the two histograms different? | The horizontal scales depend on the maximum and minimum values in the data set. |
| Can I still compare the distributions? | These histograms do not allow very effective comparison of the data. A more effective comparison would use the same bins and scale the data to be fractions of the data set rather than actual counts. |
EXAMPLE 7: Calculate histograms with explicit bin positions
Calculate bin positions explicitly to encompass range of both data sets
minBeak = min([min(Daphne), min(SantaCruz)]);
maxBeak = max([max(Daphne), max(SantaCruz)]);
xBins = minBeak:0.2:maxBeak; % Bin positions will be 0.2 apart
% Calculate the histograms based on these bins
[nD, xD] = hist(Daphne, xBins); % Histogram of Daphne Island
[nS, xS] = hist(SantaCruz, xBins); % Histogram of Santa Cruz Island
| Questions | Answers |
Why was the min of the min
needed to find the minimum beak size? |
The first pair of min functions finds the minimum
values of the Daphne and Santa Cruz data individually. The
square brackets combine these values into a two-element vector. We need to
apply another min to find the overall minimum.
|
What is the first element in
minBeak:0.2:maxBeak?
|
The first element is the value of minBeak. |
How far apart are the elements of
minBeak:0.2:maxBeak?
|
The elements are 0.2 apart. Successive elements are integral multiples of 0.2 plus the value of the first element. |
Is maxBeak always the last element of
minBeak:0.2:maxBeak?
|
No, maxBeak is only the last element if it differs
from minBeak by an integral multiple of 0.2.
Otherwise, the sequence stops with the largest multiple that
doesn't exceed the difference. |
EXAMPLE 8: Use overlapping stair plots to compare scaled distributions
legendString = {['Daphne (n=' num2str(nDaphne) ')'], ...
['Santa Cruz (n=' num2str(nSantaCruz) ')']};
figure
hold on
stairs(xD, nD/sum(nD), 'k'); % Daphne in black
stairs(xS, nS/sum(nS), 'r'); % SC in red
xlabel('Beak size in mm');
ylabel('Fraction of birds with this beak size');
legend(legendString, 'Location', 'NorthWest');
title('Scaled comparison of Daphne and Santa Cruz finches')
hold off
| Questions | Answers |
Why was the nDaphne divided by
sum(nDapne)?
|
Since the data sets did not have the same number of elements, a comparison of the counts is not meaningful. Dividing by the total number of elements plots the fractions, which are comparable. |
EXAMPLE 9: Calculate the cumulative distributions for both data sets
cumD = cumsum(nD)/sum(nD); % Cumulative distribution of Daphne Island cumS = cumsum(nS)./sum(nS); % Cumulative distribution of Santa Cruz Island
| Questions | Answers |
What does cumsum do?
|
The cumsum function calculates the partial sums. |
| What are partial sums used for? | Partial sums are used in calculus. Cumulative sums also approximate the cumulative probability distribution for the data. |
What do the values of cumD
represent?
|
A value a in cumD represents
the fraction of data set values that are less than or equal to
the corresponding value in xD. (See EXAMPLE 7.) |
EXAMPLE 10: Plot the cumulative distributions for both data sets
figure
hold on
plot(xD, cumD, 'k');
plot(xS, cumS, 'r');
title('Beak size distributions at two islands in the Galápagos')
xlabel('Beak size (mm)')
ylabel('Fraction of birds with beak size less than x')
legend(legendString, 'Location', 'NorthWest'); % Reuse legend from last example
hold off
EXAMPLE 11: Generate "random" numbers from three common probability distributions
yNormal = random('norm', 0, 1, [10000, 1]); %normal with zero mean and unit sd yUniform = random('unif', -1, 1, [10000,1]); %uniform in the interval [-1, 1] yExponential = random('exp', 1, [10000, 1]); %exponential with mean 1
| Questions | Answers |
Why are the values returned by random called pseudo-random?
|
The sequence of values produced by random is
generated by a formula and completely predictable from the
implementation of random. |
Won't I always generate the same values each time I
call random?
|
Although the sequence of values is predictable, you can pick different places (the seed) to start, giving the appearance of unpredictability. |
EXAMPLE 12: Display the histograms of the generated distributions
figure
subplot(3,1,1)
hist(yNormal,50)
title('Normal distribution (mean = 0, sd = 1)')
subplot(3,1,2)
hist(yUniform,50)
title('Uniform distribution (on interval [-1, 1])')
subplot(3,1,3)
hist(yExponential,50)
title('Exponential distribution (mean = 1)')
_This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 31-Dec-2010. Please contact krobbins@cs.utsa.edu with comments or suggestions. The image, Medium Ground Finch Geospiza fortis, Santa Cruz, Galapago taken by Mark Putney. The original source is http://www.flickr.com/photos/putneymark/13516124843/in/set-72157601810082531/. The image is available under common license at http://commons.wikimedia.org/wiki/File:Geospiza_fortis.jpg._