DAUER Example of gene expression profile analysis
Contents
Introduction
It is known that gene expression in eukaryotes is regulated by transcription factor (TF) through binding to a short piece of DNA in the upstream region. With the emerging of large-scale gene expression and genomic sequence studies, one could identify by which transcription factors a certain gene is regulated and predict how the gene will act subjected to change of environment, based on the presence of TF binding sites in its upstream sequence. With the gene expression data, genes can be clustered on the basis of the similarity of their expression profiles and these clusters are likely to contain genes that are regulated by the same transcription factors. Searches for cis-regulatory elements can then be undertaken in the upstream regions of the clustered genes. In this example we identify the cis-regulatory elements present in the genes responsible to the dauer exit process in Caenorhabditis elegans. C. elegans is a small soil nematode found in temperate regions. The dauer is an important developmental transition in C. elegans that exhibits increased longevity, stress resistance, altered metabolism compared with normal worms. The transition from the dauer state to the non-dauer state is performed here by detecting significant change of expression profiles of genes between the two states. The temporal expression of genes are measured by microarray at 10 to 12 time points every one or two hours. In detail we have the expression profile of 1984 genes within 12 hours, sampled approximately once an hour. The data set can be obtained from Stanford Microarray Database:
web('http://cmgm.stanford.edu/~kimlab/dauer/ExtraData.htm');
You can load the data directly into the MATLAB workspace
load celegansdata.mat
You can verify the real number of genes in the data set. The cell array ‘genes’ contains the name of all genes.
numel(genes)
ans = 1984
The information of one gene can be found by accessing the WormBase. Here you can also find the 1000bp upstream sequence corresponding to 1206 differentially expressed genes.
web('http://www.wormbase.org');
Let us analyze for example the gene F38E11.2 corresponding to the Heat Shock Protein.
genes{731}
web('http://www.wormbase.org/db/gene/gene?name=WBGene00002013;class=Gene');
ans = F38E11.2
A plot shows the expression profile for this ORF
figure(1) hold on for i = 1:4 plot(dauertime(dauerRep==i), degene(731, dauerRep==i)) end hold off xlabel('Time (Hours)') ylabel('Log2 Relative Expression Level');
The gene associated with this ORF, hsp-12.6, appears to be up-regulated during the dauer exit.
Missing data
As frequently happens in this type of analysis, the original matrices used in that type of experiments have various entries with missing values. It is due to error of measurement or in the construction of the array. A possible approach to deal with this problem is to find the missing values using the MATLAB function isnan and than to replace those with the row medians.
missingVals = isnan(degene); rowMedians = nanmedian(degene,2); rowMed = repmat(rowMedians,1,size(degene,2)); degene(missingVals) = rowMed(missingVals);
Cluster analysis
The following in such analysis is the clustering of genes with respect to their temporal expression profiles. We decide to use the K-means clustering algorithm. As distance measure between the data points we consider d=1-corr, where corr indicates the sample correlation between two data points. The number of clusters must be set before executing the algorithm. To get the proper number of clusters we can perform hierarchical clustering and then visually inspect the heatmap. The MATLAB function clustergram perform hierarchical clustering and generate a heat map and dendrogram of the data. Here we adopt the simplest form of clustergram that clusters the rows of a data set using correlation as the distance metric and average linkage.
totalcluster = 10; clustergram(degene,'colormap',jet,'symmetricrange',false,... 'dendrogram',{'color',totalcluster});
From the graphical display, six major groups can be seen. Based on this observation, we perform the K-means clustering by setting K = 6.
totalcluster = 6; [cidx, ctrs] = kmeans(degene, totalcluster, 'dist','corr', 'rep',5,... 'disp','final');
34 iterations, total sum of distances = 409.21 21 iterations, total sum of distances = 409.21 37 iterations, total sum of distances = 409.123 20 iterations, total sum of distances = 412.756 19 iterations, total sum of distances = 409.391
Here in each plot the gene profiles in the same cluster are plotted together.
figure for c = 1:totalcluster subplot(2,3,c); hold on for i = 1:4 plot(dauertime(dauerRep==i),degene((cidx == c),dauerRep == i)'); end hold off axis tight end suptitle('K-Means Clustering of Profiles');
Then only the centroid profiles are plotted for each cluster so that the expression pattern can be examined more clearly.
figure for c = 1:totalcluster subplot(2,3,c); hold on for i = 1:4 plot(dauertime(dauerRep==i),ctrs(c, dauerRep==i)'); end hold off axis tight end suptitle('K-Means Clustering of Centroid Profiles');
Cluster 6 represents the genes that are up-regulated at late stage during the process. Genes in cluster 3 have sharp increase at the beginning but slowly decrease toward the end of the process. Cluster 1 shows the pattern of slightly increase at the beginning followed by gradual drop down. Cluster 2 contains genes with steady increase during the whole process. Cluster 4 shows sharp decrease at the beginning, followed by slowly decrease. Genes in Cluster 5 increase sharply at the beginning, and then reach the plateau.
The motifs of genes with similar expression pattern are then searched within each cluster. The Expectation Maximization algorithm has been used for that aim.
addpath ./motif/; [names, seqs, priors] = load_seqs('seq1984.fasta', 1); shortnames = strtok(names, ' ('); shortcidx = []; for i = 1:length(shortnames) x = find(strcmp(genes, shortnames{i})); if length(x)~=0, shortcidx(i) = cidx(x); end; end for c = 1:totalcluster subseqs = seqs((shortcidx==c),:); winsize = 10; bg_model = build_uniform_motif_model(1); motif_model = build_uniform_motif_model(winsize); pcounts = repmat(0.25, 4, winsize); num_motifs = 5; [final_matrices, final_bgs] = em(subseqs, priors, bg_model, motif_model, ... pcounts, num_motifs, 1); end
Motif 1: ==> TTTTTTTTTC Motif 2: ==> AAAAAAAAAA Motif 3: ==> CCGCCTCCCC Motif 4: ==> GAAAAATTGG Motif 5: ==> TTCATTTTTC Motif 1: ==> TTTTTTTTTC Motif 2: ==> GAAAAAAAAA Motif 3: ==> CCTTTTTTCC Motif 4: ==> GAAAAAAAAA Motif 5: ==> CAATTTTTTG Motif 1: ==> TTTTTTTTTC Motif 2: ==> CAAAAAAAAA Motif 3:==> CCCCTCCCCC Motif 4: ==> GAAAAATTGG Motif 5: ==> TTTTTTTTTT Motif 1: ==> TTTTTTTTTT Motif 2: ==> GAAAAAAAAA Motif 3: ==> CCCCCCCCCC Motif 4: ==> GAAAAATTGG Motif 5: ==> CAATTTTTTC Motif 1: ==> TTTTTTTTTC Motif 2: ==> AAAAAAAAAA Motif 3: ==> CTTCCCCCCC Motif 4: ==> GAAAAATTTG Motif 5: ==> TTTTTTTTTC Motif 1: ==> TTTTTTTTTC Motif 2: ==> AAAAAAAAAA Motif 3: ==> CCCCCCCCCC Motif 4: ==> AAAAATTTTT Motif 5: ==> GAAAAAGAAA
Some of the motifs are present in almost all clusters, for example, TTTTTTTTTC and AAAAAAAAAA, and some motifs are present multiple times in the same cluster. These may due to the redundant sequences in the genome. They are often found out first probably because their frequent appearance in the genome. Therefore, the redundant sequence should be masked for the subsequent search. The rest are unique to the cluster.
Principal Component Analysis
Timing of up or down regulation of genes accurately is important to the developmental process. Principal component analysis is carried out to examine the temporal variation.
[pc, zscores, pcvars] = princomp(degene); pcvars./sum(pcvars) * 100; cumsum(pcvars./sum(pcvars) * 100)
ans = 47.2234 72.4687 81.7361 85.7149 88.2771 90.1579 91.3838 92.3931 93.2571 94.0212 94.6358 95.1224 95.5685 95.9585 96.3131 96.6203 96.8939 97.1436 97.3867 97.6112 97.8225 98.0243 98.2035 98.3699 98.5274 98.6640 98.7981 98.9248 99.0394 99.1495 99.2499 99.3447 99.4315 99.5127 99.5905 99.6659 99.7359 99.7982 99.8570 99.9074 99.9558 100.0000
The first two components take into account more than 70% of the total variation. The MATLAB function gscatter creates a grouped scatter plot, where the group is created by clusterdata function. The genes fo each group are identified also by names.
figure pcclusters = clusterdata(zscores(:,1:2),'maxclust',6,'linkage','av'); gscatter(zscores(:,1),zscores(:,2),pcclusters) xlabel('First Principal Component'); ylabel('Second Principal Component'); title('Principal Component Scatter Plot with Colored Clusters');
The first principal component represents variation approximately along the slope of increase, while the second is rather flat probably representing the variation of intercepts
References
J. Wang, S. K. Kim. Global Analysis of Gene Expression in the Dauer Larvae of Caenorhabditis elegans. Development 130:
1621-1634, 2003