LESSON 8: Vector logic for extracting data

FOCUS QUESTION: How can I extract the rows and columns of an array based on data characteristics?

Contents


EXAMPLE 1: Load the consolidated sleep diary data

    load diaries.mat;  % Load the sleep diaries

Questions Answers
Where do the variables come from when this file is loaded? The file was created by saving variables from a MATLAB workspace using the save command. When you load this type of file, MATLAB recreates the saved variables along with the values.
What is the .MAT format? The .MAT format is a binary format that allows you to save an entire workspace or multiple variables, including complex structures in a single file.
What are the advantages of saving data in .MAT format? The .MAT format efficiently stores variables and allows you to resume working in a workspace that you previously created. Thus, you don't have to reprocess data to put it in the form you need.
What are the disadvantages of saving data in .MAT format? The .MAT format is proprietary, meaning that it belongs to Mathworks. Files stored in .MAT format are not recognized by most other applications. You cannot examine the contents of such a file using a text editor.


EXAMPLE 2: Calculate the number of students in section 3 ( == )

   sect3 = (section == 3);     % sect3 has 1's corresponding to section 3 students
   totalSect3 = sum(sect3);    % Add up the true's (1's) to find number of students
   fprintf('%g students in section 3\n', totalSect3);

Questions Answers
My Workspace Browser indicates that sect3 is a logical array. What does that mean? Logical array element values are either true or false.
Why are sect3's values displayed as 1 or 0 rather than true or false? MATLAB represents the logical values true and false by the 1 and 0, respectively. You can use either representation.
Why not just make sect3 be integer or double? Because sect3 is a logical array, you know that its values will only be 1 (true) or 0 (false) and not some other numerical value.
Can I do arithmetic on logical values? Yes, you can use logical values in arithmetic expressions. MATLAB just converts logical values to 1's and 0's before doing the calculation.

52 students in section 3


EXAMPLE 3: Calculate the average minutesToSleep of students in section 3 (indexing)

   minutesSect3 = toSleepMinutes(:, sect3);  % Pick out columns of section 3 students
   meanMinutes3 = mean(minutesSect3(:));              % Find overall mean
   fprintf('Average minutes to sleep for section 3 students = %g\n', ...
       meanMinutes3);

Questions Answers
What is the purpose of using sect3 as the column specifier of minutesToSleep? This type of specifier allows you to select rows and columns based on a logical condition. MATLAB picks out the columns of toSleepMinutes corresponding to the positions where the specifier has 1's (true's).
What is the size of minutesSect3 and why? The minutesToSleep array has 21 rows and 144 columns. The variable sect3 is a vector of length 144. (This variable could not be used as an index vector for minutesToSleep unless the sizes matched.) Since sect3 has 52 ones corresponding to the 52 students in section 3, minutesSect3 will have 21 rows and 52 columns.

Average minutes to sleep for section 3 students = 17.5321


EXAMPLE 4: Calculate the number of women in the cohort (use strcmp to compare strings)

   women = strcmp(gender, 'female');  % women has 1's in positions corresponding to females
   totalWomen = sum(women);           % Add up the trues (1's) to find number of women
   fprintf('%g women in the cohort\n', totalWomen);

Questions Answers
What does strcmp(s, A) do? The strcmp function creates a logical vector that is the same size as A. The result has 1's in the locations where A contains the string s. The variable s contains a single string, and the variable A is a cell array of strings.
Why is gender a cell array rather than an array of char? Cell array elements can be of different lengths. We will almost always use cell arrays to represent arrays of strings.
How can I distinguish a cell array from an ordinary array? Use braces ({ }) to designate cell arrays and square brackets ([ ]) to designate ordinary arrays.

74 women in the cohort


EXAMPLE 5: Calculate the % of women in the cohort

   totalStudents = length(gender);  % gender has an entry for each student
   percentWomen = 100.*totalWomen./totalStudents;  % Add up all the trues (1's)
   fprintf('%g%% of the students in the cohort are women\n', percentWomen);
51.3889% of the students in the cohort are women


EXAMPLE 6: Calculate the number of men in section 2 (use &)

Create a new cell in which you type and execute:

   womenSect3 = women & sect3;        % 1's in positions of women in section 3
   totalWomen3 = sum(womenSect3);     % Add up the trues (1's)
   fprintf('%g women in section 3\n', totalWomen3);
30 women in section 3

Questions Answers
What does A & B mean?

The & symbol represents the logical element-wise AND operator. The result of A & B is an array of 0's and 1's that is the same size as the arrays A and B. The result has 1 in each entry where the corresponding elements of both A and B are non-zero, and 0 otherwise.

In the example, A represents the students who are female and B represents the the students in section 3. The two conditions combine to give the females in section 3. In other words, womenSect3 designates the subjects corresponding to women in section 3.


EXAMPLE 7: Calculate the number of students in section 2 or in section 3 (use |)

   sect2or3 = (section == 2) | (section == 3);  % sect2or3 has 1's for students in section 2 or 3
   total2or3 = sum(sect2or3);                   % Add up the trues (1's)
   fprintf('%g students in sections 2 and 3\n', total2or3);

Questions Answers
What does A | B mean?

The | symbol represents the logical element-wise OR operator. The result of A | B is an array of 0's and 1's that is the same size as the arrays A and B. The result has 1 in each entry where the corresponding elements of either A or B or both are non-zero, and 0 otherwise.

In the example, A represents the section 2 students and B represents the section 3 students. The two conditions combine to give the students who are either in section 2 or in section 3. Note: Sometimes common usage would ask for the students in "sections 2 and 3" to mean students in either section. Be careful to understand what the true logical meaning is.

98 students in sections 2 and 3


EXAMPLE 8: Calculate % of wakeups that used an alarm

   totalAlarms = sum(useAlarm(:));              % Add up the trues (1's)
   [numDays, numDiaries] = size(bedTimes);      % How many rows and columns?
   totalEntries = numDays*numDiaries;           % Total number of entries
   percentAlarm = 100*totalAlarms/totalEntries; % Percentage of total entries
   fprintf('%g%% of the wake-ups used an alarm\n', percentAlarm);

Questions Answers
Why use useAlarm(:) in the calculation of totalAlarms We wanted to compute the total number of 1's in useAlarm. The colon operator (:) arranges the columns of useAlarm into a single column. The result of sum(useAlarm(:)) is a single value.
What is the result of sum(useAlarm)? The result is a row vector of 101 elements corresponding to the column sums of useAlarm.
How could I get the total of useAlarm without using the linear representation (:)? You could apply the sum function twice: sum(sum(useAlarm)). The inner sum creates a vector of column sums. The outer sum adds these column sums to find a single number.
Why was percentAlarm calculated using * and / instead of .* and ./? Since totalAlarms and totalEntries are just numbers rather than arrays, ordinary multiplication and division work.
Could I use .* and ./ in the calculation of percentAlarm? Yes, you can use .* and ./ in place of ordinary * and / as along as the operands are just numbers (scalars) rather than arrays. The * and / operators have special meanings for matrix operands.

66.2698% of the wake-ups used an alarm


EXAMPLE 9: Calculate the number of wakeups that were 7:30 am or later (use >= )

   wakeupHours = (wakeTimes - floor(wakeTimes))*24; % Get fractional part of wakeTimes
   wakeGE730 = (wakeupHours >= 7.5);                % Which are >= 7:30 am?
   totalWakeGE730 = sum(wakeGE730(:));              % Number of wake-ups after 7:30 am.
   fprintf('%g wake-ups are 7:30 am or later\n', totalWakeGE730);
1971 wake-ups are 7:30 am or later

Questions Answers
What is the floor function? The floor function throws away the fractional part of its operand. Since wakeTimes is an array, the floor creates an array of integers that is the same size as wakeTimes.
Why multiply by 24 to compute wakeupHours? The expression wakeTimes - floor(wakeTimes) is an array containing the wake-up times in units of fraction of a day. Multiply by 24 to convert this expression to wake-up hour.
What does A >= B mean? The result of A >= B is an array of 0's and 1's that is the same size as the arrays A and B. The result has 1 in each entry where the corresponding element of A is greater than or equal to B, and 0 otherwise. Use A >= B to find the locations of where the element of A at least as large as the corresponding element of B.


EXAMPLE 10: Calculate % of wakeups between 7:30 am and 9:45 am ( &)

   wakeBetween = (7.5 <= wakeupHours) & (wakeupHours <= 9.75);  % & means both
   betweenPercent = 100*sum(wakeBetween(:))/totalEntries; % Percentage of total entries
   fprintf('%g%% of the wake-ups are between 7:30 am and 9:45 am\n', betweenPercent);
34.7222% of the wake-ups are between 7:30 am and 9:45 am

Questions Answers
What does A & B mean?

The & symbol represents the logical element-wise AND operator. The result of A & B is an array of 0's and 1's that is the same size as the arrays A and B. The result has 1 in each entry where the corresponding elements of both A and B are non-zero, and 0 otherwise.

In the example, A represents the wake up times after 7:30 am and B represents the wake up times before 9:45 am. The two conditions combine to give the wake-up times that are both after 7:30 am and before 9:45 am. In other words, wakeBetween designates the elements with wake-up times between 7:30 am and 9:30 am inclusive.

Why not just write 7.5 <= wakeupHours <= 9.75 to designate the wake up hours between 7:30 and 9:45 am? Although this expression evaluates without an error, it does not give the correct result. For example, 3 <= 4 <= 2 is true. The reason is as follows. The <= operator takes two numerical values for comparison and returns a logical value. In the example, 3 <= 4 is true. MATLAB converts the true to a 1 for the second comparison. The second comparison then becomes 1 <= 2 which is true.


EXAMPLE 11: Calculate % of wakeups that are after 7:30 am or don't use an alarm ( | and ~)

Create a new cell in which you type and execute:

   orWakeups = (wakeupHours > 7.5) | ~useAlarm;  % | either one or both
   orPercent = 100*sum(orWakeups(:))/totalEntries; % Percentage of total entries
   fprintf(['%g%% of the wake-ups are either after 7:30 am ', ...
            'or without an alarm\n'], orPercent);

Questions Answers
What does ~A mean?

The ~ symbol represents the logical element-wise NOT operator. The result of ~A is an array of 0's and 1's that is the same size as the array A. The result has 1 in each entry where the corresponding element A is 0, and 0 otherwise.

In the example, ~useAlarm represents designates the elements corresponding to wake-ups that did not use an alarm.

What does A | B mean?

The | symbol represents the logical element-wise OR operator. The result of A | B is an array of 0's and 1's that is the same size as the arrays A and B. The result has 1 in each entry where the corresponding elements of either A or B or both are non-zero, and 0 otherwise.

In the example, A represents the wake up times after 7:30 am and B represents the wake up times that did not use an alarm. The two conditions combine to give the wake-up times that are either after 7:30 am or did not use an alarm (or were both after 7:30 and did not use an alarm).

75.5622% of the wake-ups are either after 7:30 am or without an alarm


EXAMPLE 12: Find the subjects with the earliest average wakeup

   averWakeup = mean(wakeupHours);     % Subject average wake up hour
   earliest = min(averWakeup);         % Earliest average wake up hour
   earliestSub = find(averWakeup == earliest); % Pick earliest subjects
   fprintf('Earliest average wakeup time: %g\nEarliest subject(s):', earliest);
   fprintf(' %g', earliestSub);        % Separate print in case more than 1
   fprintf('\n');                      % Start a new line

Questions Answers
What good is find? The find function is useful when you need to know the actual positions of the items being selected.

Earliest average wakeup time: 5.50243
Earliest subject(s): 91


EXAMPLE 13: Find number of bedtimes between 10:30 pm and 2:30 am (relative date)

   bed = (bedTimes - floor(wakeTimes))*24;   % Hours relative to 0:00 of wake-up day
   bedBetween = (-1.5 <= bed) & (bed <= 2.5);      % & means both are true
   percentBetween = 100*sum(bedBetween(:))/totalEntries;   % Percentage of total entries
   fprintf('%g%% of the bedtimes were between 10:30 pm and 2:30 am\n', ...
       percentBetween);
57.3743% of the bedtimes were between 10:30 pm and 2:30 am

Questions Answers
Are the values of bed always positive? No, a negative value represents a bed-time before midnight.
Why is bed calculated relative to the wakeTimes rather than bedTimes?

If you used bedTimes - floor(bedTimes) instead of bedTimes - floor(wakeTimes), the bedtimes between 10:30 and 2:30 would be expressed by the condition:

bedBetween = (22.5 <= bed) & (bed <= 26.5)


This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 31-Dec-2010. Please contact krobbins@cs.utsa.edu with comments or suggestions. The image is a photograph of a nocturnal instrument photographed by Michael Daly on 8/22/2009. The image is available on Wikipedia as http://en.wikipedia.org/wiki/Nocturnal_%28instrument%29.