LAMBDAPHAGE Example of sequence statistics and segmentation with MATLAB
This demonstration looks at some statistics about the DNA content of the Lambda Phage and shows an example of segmentation of a sequence.
Contents
Introduction
Phages are viruses that infect bacteria, and Bacteriophage lambda infects the bacterium Escherichia coli, a very well studied model system. Bacteriophage lambda was the one of the first viral genomes to be completely sequenced (1982). It contains about 48502 bases. The Genome repository at the NCBI contains more interesting information about it.
web('http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome&cmd=Retrieve&dopt=Overview&list_uids=10119');
The dna sequence can be obtained from the GenBank database with the accession number NC_001416. Using the getgenbank function with the ‘SequenceOnly’ flag just the nucleotide sequence is loaded into the MATLAB workspace.
BLambda = getgenbank('NC_001416','SequenceOnly',true);
If you don’t have a live web connection, you can load the data from a MAT-file using the command
% load BLambda % <== Uncomment this if no internet connection
The MATLAB whos command gives information about the size of the sequence.
whos BLambda
Name Size Bytes Class BLambda 1x48502 97004 char array Grand total is 48502 elements using 97004 bytes
%The total length of the Bacteriophage Lambda genome is 48502 bp.
Change-point analysis
The local fluctuations in the frequencies of nucleotides provide interesting information. The local base composition by a sliding window of variable size can be measured. In the following the window size is assumed 2000 bp, 3000 bp and 4000 bp respectively.
ntdensity(BLambda,'window',2000)
ntdensity(BLambda,'window',3000)
ntdensity(BLambda,'window',4000)
The analysis of the plots reveals that the phage genome is composed of two halves with completely different GC content: the first GC rich, the second AT rich. This is an example of change point in a genome.
Segmentation with Hidden Markov Model
You can use an HMM to segment the Lambda Phage genome into blocks of these two states. You can start generating random transition and emission matrices as input to the Expectation Maximization (EM) algorithm that better estimates those parameters.
T=rand(2,2);
E=rand(2,4);
% Normalize matrices
T(1,:) = T(1,:) ./ (norm(T(1,:),1));
T(2,:) = T(2,:) ./ (norm(T(2,:),1));
E(1,:) = E(1,:) ./ (norm(E(1,:),1));
E(2,:) = E(2,:) ./ (norm(E(2,:),1));
The nucleotide ‘A’, ‘C’, ‘G’, and ‘T’ are encoded by 1, 2, 3 and 4, respectively.
seq=nt2int(BLambda); [estT, estE] = hmmtrain(seq,T,E);
With the Viterbi algorithm and the matrices previously calculated the sequence can be segmented.
estimatedStates = hmmviterbi(seq,estT,estE);
You can plot nucleotide density and change points together
ntdensity(BLambda); hold on plot(estimatedStates-1,'k--') % for visualization the states are coded as -1/1 hold off
Now you can compare this with the segmentation obtained with the initial guesses of matrices.
BADestimatedStates = hmmviterbi(seq,T,E); figure ntdensity(BLambda); hold on plot(BADestimatedStates-1,'k--') %for visualization the states are coded as -1/1 hold off