LESSON 6 QUESTIONS: Scatter plots, curve fitting and correlation

FOCUS QUESTION: How can I determine whether two variables are related?

Contents

EXAMPLE 1: Load the Darwin Finch beak size parentage data

    beaks = load('DaphneIsland.txt');

Questions Answers
What does the beaks variable hold? The beaks variable holds an array of the values of the data file DaphneIsland.txt. Each line of the file corresponds to one row of beaks. The columns of beaks are the individual tab-separated numbers on each line.
What if the file contains a mixture of text and numeric values in text format? The load and csvread functions won't work for those types of files. MATLAB provides a variety of other functions such as textscan and importdata to handle these more complicated situations.


EXAMPLE 2: Define variables for child, mother, father, and parent beak sizes

    child = beaks(:, 1);                % Size of offspring beaks
    mother = beaks(:, 2);               % Size of maternal beaks
    father = beaks(:, 3);               % Size of paternal beaks
    parent = mean(beaks(:, 2:3), 2);    % Average the parent beak sizes


EXAMPLE 3: Calculate the beak size correlations

    fprintf('Daphne Island ground finch beak size correlations:\n');
    fprintf('   mother-child:\t %g\n', corr(mother, child));
Daphne Island ground finch beak size correlations:
   mother-child:	 0.75621

Questions Answers
What does the corr function do? The corr function computes the correlation between two variables. Correlation is a standard statistical measure of how related two variables are. If the corresponding values of the two variables go up and down in a similar fashion, the correlation will be close to 1. If the values go in opposite directions, the correlation will be close to -1. If there is not much relationship, the correlation will be close to 0.
Can the value of the correlation ever go above 1? No, the value of the correlation is always in [-1, 1].
Why didn't this example assign the correlation values to variables? Since we only were going to print out the correlations and not use them in subsequent calculations, we didn't really need variables to hold these values.
Can corr(x, y) and corr(y, x) be different values? No, the values are the same. The corr function is said to be symmetric in its arguments.
Do the column vectors x and y have to have the same number of elements in order to compute corr(x, y)? Yes, x and y have to have the same number of elements. (When the arguments of corr are arrays, the situation is more complicated.)
Can I compute the correlation between two variables that are of vastly different magnitudes? Yes, corr scales the values as part of the computation.
Why can't I find the correlation of two scalar values? The correlation between two scalar values (e.g., corr(1, 3)) is undefined and corr returns NaN (which stands for "not a number"). The underlying reason is that correlation divides each column by its standard deviation as part of the scaling. The standard deviation of a single value is 0, so this scaling results in division by zero.
Does corr work for matrices? Yes, if X and Y are matrices with the same number of rows, corr(X, Y) returns a matrix of correlations. The (i, j)-th entry in the result is the correlation between column i of X and column j of Y.

EXAMPLE 4: Plot the average parent beak size against the child beak size

    pcCor = corr(parent, child);               % Calculate parent-child correlation
                                               % Create a title string
    tString = ['Daphne Island ground finches (corr = ' num2str(pcCor) ')'];
    figure('Name', tString)                    % Put this title on the window
    plot(parent, child, 'ko')                  % Plot a scatter plot
    xlabel('Mean parent beak size (mm)');      % Label the x-axis
    ylabel('Child beak size (mm)');            % Label the y-axis
    title(tString);                            % Put this title on the graph

Questions Answers
What is a scatter plot? A scatter plot graphs two variables against each other without displaying the connecting lines. The purpose of a scatter plot is to reveal relationships between variables.
How does a scatter plot differ from a line graph? In a line graph, successive pairs of points are connected by straight lines. The data must be ordered so that successive points are related. All of the line graphs that we have looked at so far use time for the first coordinate and rely on the data points being sorted either in increasing or decreasing order by time. In contrast, the data for scatter plots can be ordered in any way because no relationship is assumed between successive pairs of points.
Does MATLAB have a special function for drawing scatter plots? The MATLAB scatter function specifically draws a scatter plot. The plot function includes all of the capability of scatter, so we just use plot in this lesson.

EXAMPLE 5: Edit the figure of EXAMPLE 4


EXAMPLE 6: Display the plot in a new Figure Window

  open BeakSize.fig;


EXAMPLE 7: Calculate and print the best fit lines for beak parentage

Create a new cell in which you type and execute:

    fprintf('Best fit lines for beak size of Daphne Island ground finches:\n');

    mPoly = polyfit(mother, child, 1);   % Linear fit of mother vs child
    fprintf('   mother-child:\t y = %g*x + %g\n', mPoly(1), mPoly(2));
Best fit lines for beak size of Daphne Island ground finches:
   mother-child:	 y = 0.658622*x + 2.66147

Questions Answers
What does the 1 stand for in expression pMother = polyfit(mother, child, 1)? The 1 specifies the degree of the polynomial to fit to the data.
What is in the variable pMother after pMother = polyfit(mother, child, 1) is evaluated? The pMother(1) is the slope and pMother(2) is the intercept of the best linear fit of mother versus child.
How would I fit a quadratic to a set of data points (x, y)? Use q = polyfit(x, y, 2).


EXAMPLE 8: Evaluate each of the best fit lines to find predicted values

    mPred = polyval(mPoly, mother);   % Evaluate linear fit of mother equation

Questions Answers
What does the polyval function do? The polyval function evaluates a polynomial.
How is the polynomial to be evaluated specified? The first argument of polyval is a vector containing the coefficients of the polynomial. The coefficients are ordered with the coefficient of the highest power term being first.
How can I evaluate the polynomial y = 5x2 - 3x + 6 for x = -3? The polynomial y = 5x2 - 3x + 6 is represented by the coefficient vector p = [5, -3, 6]. The evaluation is polyval(p, -3).
How many coefficients does a cubic polynomial have? Polynomials have one more coefficient than their degree (due to the presence of the constant term). A cubic or degree 3 polynomial has 4 coefficients..
Suppose A is a 3 x 4 array and p contains the coefficients of a first degree polynomial. What size is B after the statement B = polyval(p, A) is executed? The polyval function returns an array that is the same size as A regardless of how big p is. That is, a polynomial evaluation produces a single y value for each x. Hence, B is a 3 x 4 array.


EXAMPLE 9: Find the difference between actual child beak sizes and predictions

    mErrors = child - mPred;     % Actual - predicted by mother's size

Questions Answers
What does mPred represent? The mPred values are the predictions of the child beak size from a linear model fitted from the mother's beak sizes.
Does mErrors represent the error in measurement of the mother's beak sizes? No. The mErrors variable represents the errors made when predicting the child beak sizes from the mother beak sizes.
Are mErrors always greater than zero? No. The values of the elements of mErrors may be positive or negative depending on whether the individual predictions overestimate or underestimate the actual child beak size.


EXAMPLE 10: Find the root mean squared error (RMS) between actual and predicted sizes

Create a new cell in which you type and execute:

    mRMS = sqrt( mean(mErrors.* mErrors) );

Questions Answers
What effect does squaring the error have? The obvious effect is that squaring makes all the values non-negative so it is easier to compare magnitudes. Squaring also puts more weight on larger values.
Why take the square root? Without the square root, the error is called the mean-squared error (MSE). With the square it is called the root mean squared error (RMS). RMS error has the same units as the original data, while MSE does not. For example the error in beak sizes has units of mm, since the beak size measurements are in mm. The MSE has units of mm2. Taking the square root brings us back to mm. Thus, we can make reasonable comparisons between the RMS error and the original beak sizes.


EXAMPLE 11: Output the root mean squared error (RMS)

    fprintf('Root mean squared error (RMS) for Best fit lines:\n');
    fprintf('   mother-child:\t %g\n', mRMS);
Root mean squared error (RMS) for Best fit lines:
   mother-child:	 0.484323


This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 31-Dec-2010. Please contact krobbins@cs.utsa.edu with comments or suggestions.