EVENINGELEMENT Example of identification of regulatory sequences
Contents
Introduction
Arabidopsis thaliana (also known as mouse-ear cress) is a small, weedy organism that has become the model genetic and genomic study system for plants. Its genome is approximately 120 Mb long, with five chromosomes and 29K genes. In this example we study in detail the circadian clock (i.e. the 24-hour cycle of the physiological processes) of this plant. It is known that in plants (but also in humans and other animals) the circadian clock synchronize itself with the external day-night cycle. In particular in Arabidopsis each cell keeps track of the day-night cycle indipendently of all other cells. This is due to the activity of a few key proteins that allow the plants to turn on and off large groups of genes needed at different times of day and night. They do so by binding to specific regulatory sequences called trascription factor binding sites. Regulatory DNA is the sequence surrounding a gene that specify proper trascription. It is a mosaic of short sequence motifs and semirandom DNA. These short motifs are usually found upstream of coding regions but they can also be found downstream. It is extremely difficult to identify these due to their short length, to a certain pecentage of variability in the sequence and because one does not know the motifs and much less the location. In this example we investigate some algorithms to find regulatory sequences for clock-regulated genes in Arabidopsis activated in the evening. Here we make the simplifying assumption that the motifs we are looking for is identical in every istance where it is found.
Load data in the MATLAB workspace
Therefore we can consider only the clock-regulated genes. They are 437 (6% of the genes on the chip). They have been clustered in three groups according to the temporal axis in which the experiment was done: the different time intervals are labeled as 0, 4, 8, 12, 16 and 20 and correspond to every 4 hours in a a 24 hour day/night cycle starting from 8 AM. Cluster 1 correspond to genes whose expression peaks in phases 0 and 4, cluster 2 to phases 8, 12 and cluster 3 to phases 16, 20. You can load the associated sequence directly in the MATLAB workspace, distinguishing between the clusters, and concatenate them. The data from which we look for binding sites are those genes of cluster 2 (191), that is genes highly expressed in the evening. The genes of the others clusters are used as background.
load clockregulated.mat seq1=''; seq2=''; for ind = 1:191 % cluster2 seq1=[seq1 '*' up437_set2(ind).Seq{:}]; end sr1=seqrcomplement(seq1); seq1=[seq1 '*' sr1]; for ind = 1:246 %cluster 1 and 3 seq2=[seq2 '*' up437_others(ind).Seq{:}]; end sr2=seqrcomplement(seq2); seq2=[seq2 '*' sr2];
Identifying motifs
For simplicity we decide to look only at motifs of fixed length (9): we consider all word of length 9 whose frequency in the evening cluster is very different from its frequency in the rest of the data. Therefore we examine all the words of length 9 in both sequences seq1 and seq2 (considering also the reverse complement). Motifs found are scored and sorted in descending order by margin (the difference between their frequency in cluster 2 and that in cluster 1-3). The top 10 9-mers are computed and shown.
L=9; %Motif length numMotifs=10*2; %10 highest scores elements and their reverse complements [nmers,buckets,buc1,buc2] = diff_nmers_sign(seq1,seq2,L,numMotifs)
nmers = AAAAAAAAA TTTTTTTTT AAAATATCT AGATATTTT CTCTCTCTC GAGAGAGAG AGAGAGAGA TCTCTCTCT AAAAAAAAC GTTTTTTTT AAATATCTT AAGATATTT AAAAATATC GATATTTTT AAATAAAAT ATTTTATTT AAAATATAA TTATATTTT TAAAAAAAA TTTTTTTTA buckets = 1.0e-003 * -0.2239 -0.2239 0.1439 0.1439 0.1140 0.1140 0.1074 0.1074 -0.0726 -0.0726 0.0713 0.0713 0.0695 0.0695 0.0656 0.0656 -0.0649 -0.0649 -0.0606 -0.0606 buc1 = 0.0012 0.0012 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0001 0.0001 0.0002 0.0002 buc2 = 0.0014 0.0014 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002
The obtained set of motifs contains a lot of repeats (either of single letters or of 2-mers). They likely have no biological significance and they must be filtered out.
[anmers,abuckets,abuc1,abuc2]=repeat_filter(nmers,buckets,buc1,buc2)
anmers = AAAATATCT AGATATTTT AAATATCTT AAGATATTT AAAAATATC GATATTTTT AAATAAAAT ATTTTATTT AAAATATAA TTATATTTT abuckets = 1.0e-003 * 0.1439 0.1439 0.0713 0.0713 0.0695 0.0695 0.0656 0.0656 -0.0649 -0.0649 abuc1 = 1.0e-003 * 0.1988 0.1988 0.1098 0.1098 0.1203 0.1203 0.2118 0.2118 0.0549 0.0549 abuc2 = 1.0e-003 * 0.0548 0.0548 0.0386 0.0386 0.0508 0.0508 0.1462 0.1462 0.1198 0.1198
After eliminating the repeating element, we can observe that the most significant element is the motif AAAATATCT. We known from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3 top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC). To assess the significance of the value found for the margin of the evening element we perform 100 random splits of the data and measure the margin of the highest-scoring element.
num_trial=100; margin=zeros(1,num_trial); for i=1:num_trial seq1r=''; seq2r=''; indran=randperm(437); for ind = 1:191 % cluster2 seq1r=[seq1r '*' up437(indran(ind)).Seq{:}]; end sr1=seqrcomplement(seq1r); seq1r=[seq1r '*' sr1]; for ind = 192:437 % cluster2 seq2r=[seq2r '*' up437(indran(ind)).Seq{:}]; end sr2=seqrcomplement(seq2r); seq2r=[seq2r '*' sr2]; [nmers,b,b1,b2] = diff_nmers_sign(seq1r,seq2r,L,numMotifs); [anmers,ab,ab1,ab2]=repeat_filter(nmers,b,b1,b2); if ab margin(i)=ab(1); end end find(margin>abuckets(1))
ans = Empty matrix: 1-by-0
In 100 trials we never observe a margin larger than 0.000147462.
We can look in detail at the frequency of the evening element among all the clock regulated genes. The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening element, nor this motif is limited only to these genes.
CircadianTime=0:4:20; EEcount=zeros(length(CircadianTime),1); Ngenes=zeros(length(CircadianTime),1); for ind=1:437 [indices]= strfind(up437(ind).Seq{:}, 'AAAATATCT'); [indices1]= strfind(seqrcomplement(up437(ind).Seq{:}), 'AAAATATCT'); ind1=(up437(ind).clusNum)/4+1; Ngenes(ind1)=Ngenes(ind1)+1; EEcount(ind1)=EEcount(ind1)+length(indices)+length(indices1); end Ngenes EEcount
Ngenes = 78 45 124 67 30 93 EEcount = 5 6 49 27 8 8
References
S. L. Harmer, J. B. Hogenesch, M. Straume, C. H.S. Chang, B. Han, Z.Tong X. Wang, J. A. Kreps, and S. A. Kay, Orchestrated Transcription of Key Pathways in Arabidopsis by the Circadian Clock, Science, Vol. 290. no. 5499, pp. 2110 – 2113, Dec 2000.