Freeing Phylogenies from Alignments

Michael Höhl1, Isidore Rigoutsos2, Mark Ragan
1m.hoehl@imb.uq.edu.au, Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia.; 2rigoutso@us.ibm.com, Computational Biology Center, IBM Thomas J Watson Research Center

Usually, aligning sequences is the first step in reconstructing their phylogenies. Multiple sequence alignment is a hard problem: for most data, its parameter space is not explored. Furthermore, it is impossible to achieve a satisfying overall alignment of sequences with shuffled domains.

In order to free phylogenies from alignments, we present two novel ways based on exhaustive and automated pattern discovery using TEIRESIAS (available at http://cbcsrv.watson.ibm.com/Tspd.html):

The first one computes distances from patterns. Thus, all distance-based tree inference methods become applicable. Here, a distance is defined in a fashion analogous to distances on alignments. One improvement is that it takes into account the occurrence of patterns in multiple sequences.

The second way transforms patterns into character data. This enables us to apply character-based methods; we chose the statistically sound Bayesian tree sampling tool MrBayes. The strength of character-data--not having to explicitly extract relevant properties and force them into one number--is balanced by computational expense.

First results on biological and artificial datasets indicate that both approaches are viable and may pose an alternative to the conventional alignment step, especially when the parameter space cannot be explored.