In the process of evolution, selective evolutionary forces create
variable rates of conservation on different functional sites in DNA
thereby producing distinctive signatures of different genomic regions.
Since the pattern of conservation in gene coding regions is different
from non-coding regions, a comparative computational analysis can lead,
in principle, to improved identification of genes in one species by
comparing its genome to that of the evolutionarily related other
species. Many comparative models, starting from visual studies of
sequence alignments to fully automated HMM-based TWINSCAN [Korf
2001] or
SLAM [Pachter 2002] do so by relying on a given pair of organisms, such
as human and mouse. More precisely, they rely on the ad-hoc rule that
if the two organisms are too close together or too far apart, the
approach fails as the degree of similarity/dissimilarity becomes the
same throughout the pair of genomes.
We propose a formal way to select an optimal pair of genomes/genomic
regions. We start by assuming a general Markov model of evolution that
gives a probabilistic interpretation of the evolutionary forces in
conserved and non-conserved genomic regions. We combine this model with
an HMM-based model of a comparative gene finder. In a key observation,
we relate the task of selecting the ``best'' pair of genomes to that of
minimizing the gene detection error in the combined HMM-Markov
evolutionary model as a function of evolutionary distance between
genomic regions. We study the aspects of error-analysis in HMMs, an
infrequently visited topic, and from it elucidate analytical solutions
to the problem of accuracy maximization on a simplified comparative gene
finding model. When using a more realistic gene finder model [Zhang
2003], our simulation studies indicate a wide range of genomes at
different evolutionary distances that appear to deliver reasonable
prediction of human genes. The evolutionary time
between human and mouse generally falls in this region; however, better
accuracy might be achieved with a reference species other than mouse.
Korf, I., Flicek,
P., Duan, D. & Brent, M. (2001), ‘Integrating genomic homology into gene
structure prediction’, Bioinformatics
17,
S140–S148.
Pachter, L.,
Alexandersson, M. & Cawley, S. (2002), ‘Applications of generalized pair
hidden markov models to alignment
and gene finding problems’, Journal
of Computational Biology 9,
389–400.
Zhang, L., Pavlovic, V., Cantor, C.R., & Kasif, S.
(2003), 'Human-mouse gene identification by comparative evidence integration and
evolutionary analysis', Genome Research, to appear.