Long-range correlation in protein sequences and its implication

Kazuhito Shida¹, Makoto Ikeda², Atsuo Kasuya
¹shida@cir.tohoku.ac.jp, CIR Tohoku University; ²ikeda@imr.edu, CIR Tohoku University

It is well known that there is a certain level of long-range correlations in the natural amino-acid sequences. One example is the dipeptide substitution matrix proposed by Gonnet et al. (1994), which assumes that such correlations are at least partially preserved over the process of evolution. We would like to extend this idea to the regime of longer range. We scanned the AA sequences from GenBank and observed weak correlations between two letters more than one letter apart. In some case, clear long-range patterns were found. Such correlation may effect the similarity assessment of distant homologues because it is usually performed by taking randomized sequences as a null-hypothesis. The distribution of the score under the null-hypothesis might be changed when the correct correlation is introduced into the randomization. The same effect is expected for the phylogeny analysis. Also, the database searches based on gapped n-grams, for example the PaternHunter by Ma et al. (2002), can be improved by means of the data of the long-range correlation. The expected abundance can be used to define a weight for n-grams, which enables us a more efficient usage of the index structure.