Generation and Clustering of Phylogenetic Profiles for automatic Functional Annotation of Proteins

Yen-Chen Steven Huang1, Vic Arcus, Ted Baker, Shaun Lott, Patricia Riddle, Chris Triggs
1yc.huang@auckland.ac.nz, The Centre for Molecular Biodiscovery, University of Auckland

Phylogenetic profiling is a method of assigning functional clues to proteins based on their patterns of inheritance across multiple proteomes, and is independent of their amino acid sequence similarity to proteins of known function. Phylogenetic profile analysis facilitates the study of protein function by assigning functional clues to uncharacterised proteins. It also has the potential to assign novel functional clues to proteins of known functions. Another important characteristic of phylogenetic profile analysis is that its coverage and accuracy increases with increasing numbers of whole genome sequences. The analysis will hence benefit immensely from the rapidly increasing genome sequence information available. However, it was observed that the addition of proteomes also reduces the size of clusters and eventually results in most clusters having only one protein member in which no functional linkages can be deducted from within a cluster. Here we present an improved algorithm that constructs phylogenetic profiles of proteins based on the Smith-Waterman optimal local alignment algorithm to determine the inheritance patterns across multiple proteomes. A post alignment low complexity filtering approach removes matches due to low complexity regions. The phylogenetic profiles of proteins from 81 complete proteomes were constructed based on the algorithm, and the proteins were clustered based on the similarity of their profiles. A hierarchical cluster tree was built that allows vertical traversal to different clustering levels with different profile similarity requirements to overcome the problem of singleton clusters, i.e. clusters with only one protein member. The resultant clusters were compared to the MetaCyc metabolic pathway database as well as a randomized version of the MetaCyc database. The comparison showed that proteins in the same MetaCyc metabolic pathway have significantly more similar phylogenetic profiles than randomly selected proteins, which provides a strong experimental support to the algorithm.