﻿ CS 1173 Data Analysis and Visualization

# CS 1173 Data Analysis and Visualization Review for the final exam

## Objectives:

• Consolidate knowledge.
• Review for the final.

• Lessons 8-15
• Labs 3 and 4

## The general categories of topics are:

• Basic array manipulation (assembling matrices using comma notation and semi-colon notation, linear representation, transpose, repmat, reshape, sum, picking out rows and columns in various ways).
• Basic graph interpretations (line graphs, bar charts, and combinations)
• Discovering relationships between variables (correlation, scatter plots, linear fits with errors)
• Characterizing data distributions and uncertainty (basic statistical indicators, histograms, boxplots, and error bars).
• Populations and sampling (estimating of population characteristics from sample characteristics, tests of hypotheses using ttest and ttest2).
• Programming (creation of variables, assignment, vector indexing, logical expressions, simple for loop, simple if-else).
Also, you will not be asked to produce any code involving figures. Hence, you do not need to know the options for functions such as boxplot.

Note: We have re-ordered the lessons, and sometimes the Lesson number is referred to. Ignore that number if it doesn't make sense - find the lesson TOPIC.

## Basic array manipulation and MATLAB functions

Skills:
• Manipulate arrays by extracting, assembling or reshaping (e.g., given an array x, find x(:, 2:3)).
• Evaluate the standard MATLAB functions on arrays (e.g., given an array x, find reshape(x, 2, 3)).
• Know the syntax and use of standard MATLAB functions such as sum, max, min, mean, median, std, diff, and reshape (e.g., given an array X, write a MATLAB statement to find the median of the columns).
• Be able to answer a simple word problem (e.g., given an array disease with 3 columns [year, measles cases in that year, mumps cases in that year], find the total number of measles cases or the total number of cases of disease).
What to review:
Handouts:
Videos:

## Basic interpretation of graphs

Skills:
• Correctly read graphs of various types.
• Make quantitative comparisons and observations.
• Understand when each type of graph is appropriate.
• Understand and read different graph combinations.
What to review:
• Line graphs concepts: Line graphs plot the data values as points on an x-y graph with connecting lines between consecutive points. Line graphs are used to show a sequential relationship between data points, often a time relationship.

You should be able to:
• Read the axes and understand the units.
• Understand the difference between a line graph and a scatter plot.
• Correctly associate legends with lines.
• Be able to read from multiple axes and graphs with insets.
• Be able to correctly interpret and make quantitative value and rate assertions about line graphs.
• Be able to generate appropriate labels, titles and legends based on a description of the graph.
Line graphs were introduced in Lesson 2 and Lesson 3, and each has questions associated with it.
Labs 1 and 2 emphasized line graphs.
You can find examples of line graphs in the Line graph gallery
• Pie chart concepts: Pie charts show how a quantity is broken down by percentage into component pieces. You can only read percentages (or fractions) from a pie chart unless you have the overall total value. Pie charts make it easy to assess the relative size of the constituent pieces. You should be able to:
• Correctly interpret the percentages from a pie chart.
• Translate from percentages to quantities when the overall total is provided.
• Read from pie charts with insets (such as shown in the energy pie charts in the Pie chart gallery.
• Understand the relationship between fractions of the whole and what a pie chart displays (e.g., HW3 for your hand calculation).
Pie charts were introduced in Lesson 3.
You can find examples of pie charts in the Pie chart gallery
• Bar chart concepts: Bar charts display x-y data point relationships using vertical bars. The y-value is the height of the bar, and the x-value is the position of the bar. Bar charts are very versatile. The positions of the bars can designate a sequential relationship between the x-values of successive data points (e.g., the x-values correspond to time). However, the association between successive x-values is weaker visually. The x-values don't have to have a direct association (e.g., the x-values can be categories such as setosa, virginica, and versicolor). Side-by-side bar charts are good for comparing data with subgroups (e.g., results for men and women over different years). Stacked bar charts show not only the total y-value for each value of x, but they also show how the total breaks down into its components.

You should be able to:
• Correctly interpret quantities and where appropriate, fractions on bar charts
• Correctly interpret stacked and side-by-side bar charts.
• Correctly interpret bar charts representing amounts, percentages, and rates.
Bar charts were introduced in Lesson 4.
You can find examples of bar charts in the Bar chart gallery
• Plot combinations: were introduced in Lesson 8.
You can find examples of pie charts in the Pie chart gallery You can find additional combinations of these graphs at Multiple graph gallery
Videos:

## Relationships between variables: Correlation and linear models

Skills:
• Correctly interpret correlation between two variables.
• Perform a linear fit using MATLAB polyfit and polyval functions.
• Understand the idea of modeling and errors
What to review:

• Correlation: between vector x and y is a numerical value indicating how closely x and y go up and down together. A correlation value close to 1 indicates that x and y follow each other closely, while a -1 indicates that x and y move in opposite directions. A correlation value near 0 indicates that x and y are unrelated.

The value of the correlation between x and y also corresponds to the value of R2 indicating the quality of the linear fit between x and y.

Correlation was introduced in Lesson 7.
Lab 3 and Lab 4 included calculations of correlation.
• Scatter plots: are plots of x-y points without connecting lines. Use scatterplots when there is no particular ordering of the data points. Scatterplots reveal relationships between data vectors. For example, when the points fall close to a straight line, you can effectively model the relationship between x and y by a linear equation: y = mx + b (a linear model). You can use such a linear equation to predict values of y associated with values of x that you didn't measure.

Scatter plots were introduced in Lesson 7.
Lab 3 included scatter plots.
You can find examples of scatterplots in the Scatter plots gallery
• Linear fits: given x and y, you should be able to interpret the return values of polyfit in terms of y = mx+b. Estimate growth rate given the linear fit. Calculate predicted values of y given x and the return values of polyfit. Linear fitting was introduced in Lesson 7:
Videos:

## Characterizing data distributions and descriptive statistics

Skills:
• Compute and interpret basic statistical indicators (e.g., max, min, mean, median, std, mad, and iqr).
• Be able to interpret distributions from graphical representations (histograms, box plots, and error bars).
What to review:
• Basic statistical indicators: Understand the difference between the mean and the median. Be able to translate word problems into computation of statistical indicators.

Example: The measles array contains monthly measles counts for NYC for the years 1931:1971. The rows of measles correspond to the years and the columns to the months.
• Find the average monthly case count of measles for each year.
• Find the overall average monthly case count for the entire data set.
Statistical indicators were introduced in Lesson 5:
Note: be sure you understand the difference between the actual standard deviation and the unbiased estimator of the population standard deviation.
• Histograms: are tables of how many times each value appears in a data set. When the data set has a lot of different values, the values are binned into subintervals and the counts of the number of points in each subinterval are given. The scaled histogram is an approximation to the probability distribution represented by the data. Thus, a histogram plot is usually the first step in determining how your data might be distributed. Your should be able to interpret histograms given as frequency tables or bar graph (e.g., given the histogram of disease, how many cases of each type are there). You should also be able to compare distributions by comparing their histograms. Histograms were introduced in Lesson 9.
You can find examples of histograms in the Histograms gallery
• Boxplots: display the distribution of data using a box with different features to mark different points such as the median and the interquartile range. Boxplots give a good representation of data sets with outliers. Side-by-side boxplots are useful for comparing the distributions of different data sets.

You should be able to describe the meaning of the various features of the box plot (e.g., given a box plot, what is the IQR, median, fence, etc.). You should also be able to compare two distributions by comparing their respective boxplots.

Boxplots were introduced in Lesson 12.
The Box plots handout gives a diagram of the basic features of the box plot.

You can find examples of box plots in the Box plots gallery
• Error bar graphs: display data as a central point with a range. Error bar graphs can be used to depict spread: for example by using the mean as the central point and the standard deviation for the range.

Error bar graphs can also be used to depict uncertainty. For example, you want to use a sample to estimate the population mean. The center point is the sample mean (which estimates the population mean). The range can be the standard error of the mean (SEM) or the 95% confidence interval for the estimates of population mean based on SEM. The SEM error bars indicate the standard deviation of the error estimates of the population means.

Error bars were introduced in Lesson 6.
Lab 2 used error bars
You can find examples of error bars in the Error bar gallery
Handouts:
Videos

## Hypothesis testing, normality, uncertainty, sampling and logarithms

Skills:
• Understand the concept of sampling to estimate population characteristics.
• Distinguish between sample and population indicators (e.g., estimate the population mean or standard deviation given a sample; what does the SEM represent?)
• Formulate null and alternative hypotheses for a problem.
• Test hypotheses using the two-sample t-test (ttest2).
What to review:
• Populations and samples: the notion of measuring characteristics of a sample and extrapolating to a population underpins much of science. You should understand that statistical tests rely on assumptions that are never exactly fulfilled in practice. The most common assumptions are that the sample is drawn at random from the population and that the population has a particular distribution (usually normal).

Your should be able to argue (based on common sense) how well a sample reflects the population as a whole. You should also understand how the characteristics of a sample can be used to estimate population characterstics.

The handout Populations and samples explains how characteristics of a population can be estimated from samples

• Two-sample t-test is statistical test performed with the MATLAB ttest2 function. This test indicates whether the true means of two populations are likely to be different based on the evidence of a sample drawn from each of the respective populations. Investigations looking for differences in the means of two populations appear often in medical investigations (e.g., differences between treated and control groups).

The ttest2 has three return values: h, p, and ci.

The value of h indicates whether the null hypothesis should be rejected in favor of the alternative hypothesis (e.g., if h is 1, reject --- the means are likely to be different).

The p represents the p-value, which indicates how likely the observed samples would occur if the true means were actually the same. (A small p-value supports rejection of the null hypothesis in favor of the alternative.)

The ci represents the confidence interval for the difference of the true population means. If the ci does not include zero, the true means are likely to be different. The farther away from 0 the ci is, the more likely it is that the means are different.

You should be able to apply ttest2 to a problem, formulating the null and alternative hypotheses and explaining the results. For example, given a question such as "do men sleep longer than women" and a sample from each population, you should be able to formulate a t-test and interpret the results. You should also be able to interpret a confidence interval and a p-value.

The ttest2 was introduced in Lesson 11.
Lab 4 uses ttest2.
• Logarithmic scales provide an alternative to linear growth rates. If your data is changing very rapidly, and both the large and small scale changes are important, then a logarithmic scale might work best for you. It is best used for exponential growth.

Logarithms were introduced in Lesson 15.

The handout Logarithms and growth rates provides more details.
Handouts: Videos

## Programming (vector logic and program control)

Skills:
• Use vector logic for counting how many.
• Use vector logic to extract rows and columns.
• Formulate complex logical conditions using AND (&), OR (|) and NOT (~).
• Create a logical vector to represent conditions (e.g., given age, find a logical vector representing patients between the ages of 45 and 65).
• Use logical vectors to extract rows and columns (e.g., given a vector of patient genders 'male' and 'female' and an array alarmUse of days x subjects, extract an array representing male alarm use).
• Understand simple for loops and if-else statements.
What to review:
• Logical expressions: logical expressions produce vectors of 0's and 1's. A logical vector has 1's corresponding to the entries in which the expression is true and false in the other positions.

You will be asked to write various logical expressions given a word problem (such as those of HW5). You should know how to use logical expressions for both counting and extraction.

Vector indexing was introduced in Lesson 10
The vector indexing worksheet (HW5): contains exercises on analyzing data using vector indexing.
Lab 4, as well as HW5 contain vector indexing.
Also see the in-class surveys.

• Loops and selection allow us to change the path of execution through scripts to take alternative actions.

You should be able to trace simple for loop and if-else statements, showing the values that the variables take on as the statements execute. Only the most basic program control will be on the final.

• Program control: was introduced in Lesson 13.
Most subsequent lessons used program control.
Some queries explored program control.

Videos:

## Extra credit

Up to 5 points will be added to your final exam score for answering a survey and handing it in with your final. The survey can be found on Learn under Resources. An additional 6 points towards your final exam score can be earned by successfully completing the Post Test, also found on Learn, under Quizzes. It can only be attempted once, and is timed.

This lecture summary was written by Kay A. Robbins of the University of Texas at San Antonio and last modified by Dawn Roberson on 26 Apr 2016. Please contact kay.robbins@utsa.edu with comments or suggestions.