HW3.1 Gene Expression Data Clustering and Gene Ontology Analysis


Part 1: load data into matlab and basic visulization

% read numerical data. skip first row and first column
rawData = dlmread('cellcycle.txt', '\t', 1, 1);

% reads first column of each row as string, skip rest of data from each row.
geneNames = textread('cellcycle.txt', '%s%*[^\n]', 'headerlines', 1);

[r, c] = size(rawData);

disp(['size of the data set is: ' int2str(r) ' x ' int2str(c)]);

% boxplot of raw and normalized data

set(gca, 'xtick', 5:5:c, 'xticklabel', 5:5:c);
xlabel('sample ID');
ylabel('Expression level');
title('Raw data');

% quantile normalize data
qNormData = quantilenorm(rawData);
% This line below normalizes each gene (row) to have mean = 0 and std = 1.
% This is needed for cell cycle gene analysis because we care more about
% the cyclic trend than the magnitude of  the change
normData = zscore(qNormData, 0, 2);

set(gca, 'xtick', 5:5:c, 'xticklabel', 5:5:c);
xlabel('sample ID');
ylabel('Expression level');
title('Normalized data');

% heatmap of normalized data.
size of the data set is: 3000 x 73

Part II: data clustering and visulization of clusters

% this clusters the genes into 10 clusters. Remember kmeans is like an EM algorithm and uses random
% starting points as cluster centers. Repeat 20 times with different random starting points to have
% a better chance of finding the optimal solution.
% idx - cluster ID for each gene.
% center: average expression for each gene cluster.

nClusters = 10;
[idx, center] = kmeans(normData, nClusters, 'Replicates', 20);

% convert idx to a cell structure used by showRowClusters
clust = cell(1, nClusters);
for i = 1:nClusters
    clust{i} = find(idx' == i);

% display the clusters (heat map) in Figure 4
showRowClusters(clust, normData, 4);

% display the average expression level of each cluster in figure 5.

for i = 1:nClusters
    subplot(ceil(nClusters/2), 2, i);
    plot(center(i, :));
    title(['cluster id: ', int2str(i)]);

% based on figure 4, pick out 4 clusters that shows the best cyclic pattern
% the cluster indices you see are usually different from the numbers below.
% uncomment the statement below and change to the correct numbers

%cyclic = [1 5 9 10];

% create a figure to visualize and compare the 4 selected clusters in more detail.


timePoints = 5:40;
plot(center(cyclic, timePoints)');
legend([repmat('clustID = ', 4, 1) int2str(cyclic')])

Part III. Get genes in different clusters and perform Gene Ontology analysis.

% reorder the clusters in the order they peaked their expression
% yours might be different

phases = [2 3 1 4];

for i = 1:4
    fileName = strcat('phase', int2str(i), 'genes.txt');
    genes = geneNames(clust{cyclic(phases(i))});
    dlmwrite(fileName, char(genes), 'delimiter', '', 'newline', 'pc');