Long-range correlation in protein sequences and its implication
Kazuhito Shida1, Makoto Ikeda2, Atsuo Kasuya
1shida@cir.tohoku.ac.jp, CIR Tohoku University; 2ikeda@imr.edu, CIR Tohoku University
It is well known that there is a certain level of
long-range correlations in the natural amino-acid
sequences.
One example is the dipeptide substitution matrix proposed by Gonnet et al. (1994),
which assumes that
such correlations are at least partially preserved over
the process of evolution.
We would like to extend this idea to the regime of
longer range.
We scanned the AA sequences from GenBank and
observed weak correlations between
two letters more than one letter apart.
In some case, clear long-range patterns were found.
Such correlation may effect
the similarity assessment of distant homologues
because it is usually performed by
taking randomized sequences as a null-hypothesis.
The distribution of the score
under the null-hypothesis might be changed
when the correct correlation is introduced into
the randomization.
The same effect is expected
for the phylogeny analysis.
Also, the database searches based on gapped n-grams,
for example the PaternHunter by Ma et al. (2002),
can be improved by means of the data of the long-range correlation.
The expected abundance can be used to define
a weight for n-grams, which enables us a more efficient usage of the index structure.