CS 1173 Data Analysis and Visualization
Review for the final exam
Objectives:
 Consolidate knowledge.
 Review for the final.
Final Exam comprehensive, but emphasizes:
 Lessons 815
 Labs 3 and 4
The general categories of topics are:
 Basic array manipulation (assembling matrices using comma notation and semicolon notation,
linear representation, transpose, repmat, reshape, sum,
picking out rows and columns in various ways).
 Basic graph interpretations (line graphs, bar charts, and combinations)
 Discovering relationships between variables (correlation, scatter plots, linear fits with errors)
 Characterizing data distributions and uncertainty
(basic statistical indicators, histograms, boxplots, and error bars).
 Populations and sampling (estimating of
population characteristics from sample characteristics, tests of hypotheses using ttest and ttest2).
 Programming (creation of variables, assignment, vector indexing, logical expressions,
simple for loop, simple ifelse).
Also, you will not be asked to produce any code involving figures. Hence,
you do not need to know the options for functions such as boxplot.
Note: We have reordered the lessons, and sometimes the Lesson number is referred to. Ignore that number if it doesn't make sense  find the lesson TOPIC.
Basic array manipulation and MATLAB functions
Skills:
 Manipulate arrays by extracting, assembling or reshaping (e.g., given
an array x, find x(:, 2:3)).
 Evaluate the standard MATLAB functions on arrays (e.g., given an array
x, find reshape(x, 2, 3)).
 Know the syntax and use of standard MATLAB functions such as sum,
max, min, mean, median,
std, diff, and reshape (e.g., given
an array X, write a MATLAB statement to find the median of
the columns).
 Be able to answer a simple word problem (e.g., given an array
disease with 3 columns [year, measles cases in that year, mumps cases
in that year], find the total number of measles cases or the total number
of cases of disease).
What to review:
Handouts:
Videos:
Basic interpretation of graphs
Skills:
 Correctly read graphs of various types.
 Make quantitative comparisons and observations.
 Understand when each type of graph is appropriate.
 Understand and read different graph combinations.
What to review:
 Line graphs concepts: Line graphs plot the data values
as points on an xy graph with connecting lines between consecutive points.
Line graphs are used to show
a sequential relationship between data points, often a time relationship.
You should be able to:
 Read the axes and understand the units.
 Understand the difference between a line graph and a scatter plot.
 Correctly associate legends with lines.
 Be able to read from multiple axes and graphs with insets.
 Be able to correctly interpret and make quantitative value and rate
assertions about line graphs.
 Be able to generate appropriate labels, titles and legends based on a description of the graph.
Line graphs were introduced in Lesson 2 and Lesson 3, and each has questions associated with it.
Labs 1 and 2 emphasized line graphs.
You can find examples of line graphs in the Line graph gallery
 Pie chart concepts: Pie charts show how a quantity is broken down by percentage
into component pieces. You can only read percentages (or fractions) from a pie chart unless you
have the overall total value. Pie charts make it easy
to assess the relative size of the constituent pieces. You should be able to:
 Correctly interpret the percentages from a pie chart.
 Translate from percentages to quantities when the overall total is provided.
 Read from pie charts with insets (such as shown in the energy
pie charts in the Pie chart gallery.
 Understand the relationship between fractions of the whole and what a
pie chart displays (e.g., HW3
for your hand calculation).
Pie charts were introduced in Lesson 3.
You can find examples of pie charts in the Pie chart gallery
 Bar chart concepts: Bar charts display xy data point relationships using
vertical bars. The yvalue is the height of the bar, and the xvalue is the position of the bar.
Bar charts are very versatile. The positions of the bars can designate a sequential relationship
between the xvalues of successive data points (e.g., the xvalues correspond to time). However,
the association between successive xvalues is weaker visually. The xvalues don't have to have a
direct association (e.g., the xvalues can be categories such as setosa, virginica,
and versicolor). Sidebyside bar charts are good for comparing data with subgroups (e.g.,
results for men and women over different years). Stacked bar charts show not only the total yvalue
for each value of x, but they also show how the total breaks down into its components.
You should be able to:
 Correctly interpret quantities and where appropriate, fractions on bar charts
 Correctly interpret stacked and sidebyside bar charts.
 Correctly interpret bar charts representing amounts, percentages, and rates.
Bar charts were introduced in Lesson 4.
You can find examples of bar charts in the Bar chart gallery
 Plot combinations:
were introduced in Lesson 8.
You can find examples of pie charts in the Pie chart gallery
You can find additional combinations of these graphs at
Multiple graph gallery
Videos:
Relationships between variables: Correlation and linear models
Skills:
 Correctly interpret correlation between two variables.
 Perform a linear fit using MATLAB polyfit and polyval functions.
 Understand the idea of modeling and errors
What to review:
 Correlation: between vector x and y is
a numerical value indicating how closely x and y go up and down
together. A correlation value close to 1 indicates that x and y
follow each other closely, while a 1 indicates that x and y move
in opposite directions. A correlation value near 0 indicates that x and
y are unrelated.
The value of the correlation between x and y also corresponds
to the value of R^{2} indicating the quality of the linear fit between x
and y.
Correlation was introduced in Lesson 7.
Lab 3 and Lab 4 included calculations of correlation.
 Scatter plots: are plots of xy points without connecting lines.
Use scatterplots when there is no particular ordering of the data points. Scatterplots reveal
relationships between data vectors. For example, when the points fall close to a straight line,
you can effectively model the relationship between x and y
by a linear equation: y = mx + b (a linear model). You can use such a linear equation to predict
values of y associated with values of x that you didn't measure.
Scatter plots were introduced in Lesson 7.
Lab 3 included scatter plots.
You can find examples of scatterplots in the Scatter plots gallery
 Linear fits: given x and y, you should be
able to interpret the return values of polyfit in terms of y = mx+b. Estimate growth
rate given the linear fit. Calculate predicted values of y given x and
the return values of polyfit.
Linear fitting was introduced in Lesson 7:
Videos:
Characterizing data distributions and descriptive statistics
Skills:
 Compute and interpret basic statistical indicators (e.g.,
max, min, mean, median,
std, mad, and iqr).
 Be able to interpret distributions from graphical representations
(histograms, box plots, and error bars).
What to review:

Basic statistical indicators: Understand the difference between the
mean and the median. Be able to translate word problems into computation of statistical indicators.
Example: The measles array contains monthly measles counts for NYC
for the years 1931:1971. The rows of measles correspond to the years and the columns
to the months.
 Find the average monthly case count of measles for each year.
 Find the overall average monthly case count for the entire data set.
Statistical indicators were introduced in Lesson 5:
Note: be sure you understand the difference between the actual standard deviation and
the unbiased estimator of the population standard deviation.
 Histograms: are tables of how many times each value appears in a data set.
When the data set has a lot of different values, the values are binned into subintervals and the
counts of the number of points in each subinterval are given. The scaled histogram is an approximation
to the probability distribution represented by the data. Thus, a histogram plot is usually the
first step in determining how your data might be distributed.
Your should be able to interpret histograms given as frequency tables or bar graph (e.g.,
given the histogram of disease, how many cases of each type are there). You should also
be able to compare distributions by comparing their histograms.
Histograms were introduced in Lesson 9.
You can find examples of histograms in the Histograms gallery
 Boxplots: display the distribution of data using a box with different features to
mark different points such as the median and the interquartile range. Boxplots give a good representation
of data sets with outliers. Sidebyside boxplots are useful for comparing the distributions of
different data sets.
You should be able to describe the meaning of the various features of the box plot
(e.g., given a box plot, what is the IQR, median, fence, etc.). You should also be able
to compare two distributions by comparing their respective boxplots.
Boxplots were introduced in Lesson 12.
The Box plots handout gives a diagram of the
basic features of the box plot.
You can find examples of box plots in the Box plots gallery
 Error bar graphs: display data as a central point with a range. Error bar graphs can
be used to depict spread: for example by using the mean as the central point and the standard deviation for
the range.
Error bar graphs can also be used to depict uncertainty. For example, you want to use a sample to estimate
the population mean. The center point is the sample mean (which estimates the population mean). The
range can be the standard error of the mean (SEM) or the 95% confidence interval for the estimates
of population mean based on SEM. The SEM
error bars indicate the standard deviation of the error estimates of the population means.
Error bars were introduced in Lesson 6.
Lab 2 used error bars
You can find examples of error bars in the Error bar gallery
Handouts:
Videos
Hypothesis testing, normality, uncertainty, sampling and logarithms
Skills:
 Understand the concept of sampling to estimate population characteristics.
 Distinguish between sample and population indicators (e.g.,
estimate the population mean or standard deviation given a sample; what
does the SEM represent?)
 Formulate null and alternative hypotheses for a problem.
 Test hypotheses using the twosample ttest (ttest2).
What to review:
 Populations and samples: the notion of measuring characteristics
of a sample and extrapolating to a population underpins much of science. You should understand
that statistical tests rely on assumptions that are never exactly fulfilled in
practice. The most common assumptions are that the sample is drawn at random from
the population and that the population has a particular distribution (usually normal).
Your should be able to argue (based on common sense) how well a sample reflects the
population as a whole. You should also understand how the characteristics of a sample can
be used to estimate population characterstics.
The handout Populations and samples
explains how characteristics of a population can be estimated from samples
 Twosample ttest is statistical test performed with the MATLAB
ttest2 function. This test indicates whether the true means of two
populations are likely to be different based on the evidence of a sample drawn from
each of the respective populations. Investigations looking for differences in the
means of two populations appear often in medical investigations (e.g., differences
between treated and control groups).
The ttest2 has three return values: h, p, and
ci.
The value of h indicates whether the null hypothesis
should be rejected in favor of the alternative hypothesis (e.g., if h
is 1, reject  the means are likely to be different).
The p represents
the pvalue, which indicates how likely the observed samples would occur if the true
means were actually the same. (A small pvalue supports rejection of the null hypothesis
in favor of the alternative.)
The ci represents the confidence interval for the difference of the true
population means. If the ci does not include zero, the true means are
likely to be different. The farther away from 0 the ci is, the more likely
it is that the means are different.
You should be able to apply ttest2 to a problem, formulating the
null and alternative hypotheses and explaining the results. For example,
given a question such as "do men sleep
longer than women" and a sample from each population, you should be able to formulate a ttest
and interpret the results. You should also be able to interpret a confidence interval
and a pvalue.
The ttest2 was introduced in Lesson 11.
Lab 4 uses ttest2.
 Logarithmic scales provide an alternative to linear growth rates. If your data is changing very rapidly, and both the large and small scale changes
are important, then a logarithmic scale might work best for you. It is best used for exponential growth.
Logarithms were introduced in Lesson 15.
The handout Logarithms and growth rates provides more details.
Handouts:
Videos
Programming (vector logic and program control)
Skills:
 Use vector logic for counting how many.
 Use vector logic to extract rows and columns.
 Formulate complex logical conditions using AND (&), OR
() and NOT (~).
 Create a logical vector to represent conditions (e.g., given age,
find a logical vector representing patients between the ages of 45 and 65).
 Use logical vectors to extract rows and columns (e.g., given a vector
of patient genders 'male' and 'female' and an
array alarmUse of days x subjects, extract an array representing
male alarm use).
 Understand simple for loops and ifelse statements.
What to review:
 Logical expressions: logical expressions produce vectors of 0's and 1's.
A logical vector has 1's corresponding to the entries in which the expression is
true and false in the other positions.
You will be asked to write various logical expressions given a word problem (such as those of HW5). You should
know how to use logical expressions for both counting and extraction.
Vector indexing was introduced in Lesson 10
The vector indexing worksheet (HW5):
contains exercises on analyzing data using vector indexing.
Lab 4, as well as HW5 contain vector indexing.
Also see the inclass surveys.
 Loops and selection allow us to change the path of execution through
scripts to take alternative actions.
You should be able to trace simple for loop and ifelse
statements, showing the values that the variables take on as the statements
execute. Only the most basic program control will be on the final.
 Program control: was introduced in Lesson 13.
Most subsequent lessons used program control.
Some queries explored program control.
Videos:
Extra credit
Up to 5 points will be added to your final exam score for answering
a survey and handing it in with your final. The survey can be found on Learn under Resources. An additional 6 points towards your final exam
score can be earned by successfully completing the Post Test, also found on Learn, under Quizzes. It can only be attempted once, and is timed.
This lecture summary was written by Kay A. Robbins of the
University of Texas at San Antonio and last modified by Dawn Roberson on 26 Apr 2016. Please contact
kay.robbins@utsa.edu with comments or
suggestions.