We describe an algorithm that identifies families of genes with similar nucleotide patterns and hopefully similar function. The procedure assumes a hidden Markov model (HMM) for the evolution of the DNA sequence. Typically, HMM segmentation methods assume that the observed process - here the DNA sequence - evolves independently given the unobserved Markov chain which locates the position of the family members. However, this assumption does not properly account for the additional short-range structure that is often evident in DNA sequences. Additionally, HMM-based classification methods attempt to allocate regions of DNA to a known number of possible families. Our algorithm is more general in that it determines the number of statistically significant families and their distinctive (complex) nucleotide patterns.
In this work, we adopt a Bayesian approach to inference which allows us to take full account of the uncertainty in the locations and the composition of the various gene families. It also permits the incorporation of prior knowledge about these unknowns and provides a coherent framework for model comparison/selection. The complex structure of this model precludes a fully analytic treatment and we therefore use modern computationally intensive statistical techniques. Markov chain Monte Carlo (MCMC) algorithms are ideally suited to such problems and, in particular, we use trans-dimensional MCMC to explore both the parameter and model space.
We illustrate the general method by investigating the structure of the bacteriophage lambda genome, a common benchmark sequence used for the comparison of statistical segmentation algorithms.