EVENINGELEMENT Example of identification of regulatory sequences

Introduction
Load data in the MATLAB workspace
Identifying motifs
References

Introduction

Arabidopsis thaliana (also known as mouse-ear cress) is a small, weedy organism that has become the model genetic and genomic study system for plants. Its genome is approximately 120 Mb long, with five chromosomes and 29K genes. In this example we study in detail the circadian clock (i.e. the 24-hour cycle of the physiological processes) of this plant. It is known that in plants (but also in humans and other animals) the circadian clock synchronize itself with the external day-night cycle. In particular in Arabidopsis each cell keeps track of the day-night cycle indipendently of all other cells. This is due to the activity of a few key proteins that allow the plants to turn on and off large groups of genes needed at different times of day and night. They do so by binding to specific regulatory sequences called trascription factor binding sites. Regulatory DNA is the sequence surrounding a gene that specify proper trascription. It is a mosaic of short sequence motifs and semirandom DNA. These short motifs are usually found upstream of coding regions but they can also be found downstream. It is extremely difficult to identify these due to their short length, to a certain pecentage of variability in the sequence and because one does not know the motifs and much less the location. In this example we investigate some algorithms to find regulatory sequences for clock-regulated genes in Arabidopsis activated in the evening. Here we make the simplifying assumption that the motifs we are looking for is identical in every istance where it is found.

Load data in the MATLAB workspace

Therefore we can consider only the clock-regulated genes. They are 437 (6% of the genes on the chip). They have been clustered in three groups according to the temporal axis in which the experiment was done: the different time intervals are labeled as 0, 4, 8, 12, 16 and 20 and correspond to every 4 hours in a a 24 hour day/night cycle starting from 8 AM. Cluster 1 correspond to genes whose expression peaks in phases 0 and 4, cluster 2 to phases 8, 12 and cluster 3 to phases 16, 20. You can load the associated sequence directly in the MATLAB workspace, distinguishing between the clusters, and concatenate them. The data from which we look for binding sites are those genes of cluster 2 (191), that is genes highly expressed in the evening. The genes of the others clusters are used as background.

load clockregulated.mat

seq1='';
seq2='';
for ind = 1:191 % cluster2
    seq1=[seq1 '*' up437_set2(ind).Seq{:}];
end
sr1=seqrcomplement(seq1);
seq1=[seq1 '*' sr1];

for ind = 1:246 %cluster 1 and 3
    seq2=[seq2 '*' up437_others(ind).Seq{:}];
end
sr2=seqrcomplement(seq2);
seq2=[seq2 '*' sr2];

Identifying motifs

For simplicity we decide to look only at motifs of fixed length (9): we consider all word of length 9 whose frequency in the evening cluster is very different from its frequency in the rest of the data. Therefore we examine all the words of length 9 in both sequences seq1 and seq2 (considering also the reverse complement). Motifs found are scored and sorted in descending order by margin (the difference between their frequency in cluster 2 and that in cluster 1-3). The top 10 9-mers are computed and shown.

L=9;  %Motif length
numMotifs=10*2; %10 highest scores elements and their reverse complements

[nmers,buckets,buc1,buc2] = diff_nmers_sign(seq1,seq2,L,numMotifs)

nmers =

AAAAAAAAA
TTTTTTTTT
AAAATATCT
AGATATTTT
CTCTCTCTC
GAGAGAGAG
AGAGAGAGA
TCTCTCTCT
AAAAAAAAC
GTTTTTTTT
AAATATCTT
AAGATATTT
AAAAATATC
GATATTTTT
AAATAAAAT
ATTTTATTT
AAAATATAA
TTATATTTT
TAAAAAAAA
TTTTTTTTA


buckets =

  1.0e-003 *

   -0.2239
   -0.2239
    0.1439
    0.1439
    0.1140
    0.1140
    0.1074
    0.1074
   -0.0726
   -0.0726
    0.0713
    0.0713
    0.0695
    0.0695
    0.0656
    0.0656
   -0.0649
   -0.0649
   -0.0606
   -0.0606


buc1 =

    0.0012
    0.0012
    0.0002
    0.0002
    0.0002
    0.0002
    0.0002
    0.0002
    0.0001
    0.0001
    0.0001
    0.0001
    0.0001
    0.0001
    0.0002
    0.0002
    0.0001
    0.0001
    0.0002
    0.0002


buc2 =

    0.0014
    0.0014
    0.0001
    0.0001
    0.0001
    0.0001
    0.0001
    0.0001
    0.0002
    0.0002
    0.0000
    0.0000
    0.0001
    0.0001
    0.0001
    0.0001
    0.0001
    0.0001
    0.0002
    0.0002

The obtained set of motifs contains a lot of repeats (either of single letters or of 2-mers). They likely have no biological significance and they must be filtered out.

[anmers,abuckets,abuc1,abuc2]=repeat_filter(nmers,buckets,buc1,buc2)

anmers =

AAAATATCT
AGATATTTT
AAATATCTT
AAGATATTT
AAAAATATC
GATATTTTT
AAATAAAAT
ATTTTATTT
AAAATATAA
TTATATTTT


abuckets =

  1.0e-003 *

    0.1439
    0.1439
    0.0713
    0.0713
    0.0695
    0.0695
    0.0656
    0.0656
   -0.0649
   -0.0649


abuc1 =

  1.0e-003 *

    0.1988
    0.1988
    0.1098
    0.1098
    0.1203
    0.1203
    0.2118
    0.2118
    0.0549
    0.0549


abuc2 =

  1.0e-003 *

    0.0548
    0.0548
    0.0386
    0.0386
    0.0508
    0.0508
    0.1462
    0.1462
    0.1198
    0.1198

After eliminating the repeating element, we can observe that the most significant element is the motif AAAATATCT. We known from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3 top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC). To assess the significance of the value found for the margin of the evening element we perform 100 random splits of the data and measure the margin of the highest-scoring element.

num_trial=100;
margin=zeros(1,num_trial);
for i=1:num_trial
    seq1r='';
    seq2r='';
    indran=randperm(437);

    for ind = 1:191 % cluster2
        seq1r=[seq1r '*' up437(indran(ind)).Seq{:}];
    end
    sr1=seqrcomplement(seq1r);
    seq1r=[seq1r '*' sr1];

    for ind = 192:437 % cluster2
        seq2r=[seq2r '*' up437(indran(ind)).Seq{:}];
    end
    sr2=seqrcomplement(seq2r);
    seq2r=[seq2r '*' sr2];

    [nmers,b,b1,b2] = diff_nmers_sign(seq1r,seq2r,L,numMotifs);

    [anmers,ab,ab1,ab2]=repeat_filter(nmers,b,b1,b2);

    if ab
       margin(i)=ab(1);
    end

end

find(margin>abuckets(1))

ans =

   Empty matrix: 1-by-0

In 100 trials we never observe a margin larger than 0.000147462.

We can look in detail at the frequency of the evening element among all the clock regulated genes. The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening element, nor this motif is limited only to these genes.

CircadianTime=0:4:20;
EEcount=zeros(length(CircadianTime),1);
Ngenes=zeros(length(CircadianTime),1);
for ind=1:437
    [indices]= strfind(up437(ind).Seq{:}, 'AAAATATCT');
    [indices1]= strfind(seqrcomplement(up437(ind).Seq{:}), 'AAAATATCT');
    ind1=(up437(ind).clusNum)/4+1;
    Ngenes(ind1)=Ngenes(ind1)+1;
    EEcount(ind1)=EEcount(ind1)+length(indices)+length(indices1);
end

Ngenes
EEcount

References

S. L. Harmer, J. B. Hogenesch, M. Straume, C. H.S. Chang, B. Han, Z.Tong X. Wang, J. A. Kreps, and S. A. Kay, Orchestrated Transcription of Key Pathways in Arabidopsis by the Circadian Clock, Science, Vol. 290. no. 5499, pp. 2110 – 2113, Dec 2000.

EVENINGELEMENT Example of identification of regulatory sequences

Contents

Introduction

Load data in the MATLAB workspace

Identifying motifs

References