Partially supervised clustering of gene expression time course data
Alexander Schoenhuth1, Alexander Schliep2, Christine Steinhoff
1aschoen@zpr.uni-koeln.de, Center for Applied Computer Science, University Cologne; 2schliep@molgen.mpg.de, Max Planck Institute for Molecular Genetics, Berlin
Performing microarray experiments consecutively in time produces a time course of gene
expression profiles. New approaches to classifying these time courses are
clustering methods based on models. Statistical models are used to represent clusters and cluster
membership is decided based on maximization of a data point's likelihood given a model/cluster.
Model-based clustering accounts for horizontal dependencies between expression levels of
different time points and so is more suitable for classifying time courses than conventional,
usually distance-based, methods.
As the amount of genes with known function available is growing there is a need for
classification methods which allow the use of prior knowledge. This can be realized by
partially supervised clustering: models, which represent and are learnt from labeled sets of
genes with known function, are added to the collection of clusters. In the iteration
steps of the clustering algorithm, reassignment to other clusters of the labeled data is
prohibited.
In our case clusters are represented by Hidden Markov Models (HMM's). Besides
their use in biological sequence analysis HMM's have been successfully applied for analyzing time
course data in a wide range of different problem domains. An initial collection of models is
chosen encompassing typical qualitative behavior like up- or down-regulation. Models learnt
from labeled genes are added. In the iteration steps new model parameters are computed using
Baum-Welch-Training (Expectation-Maximization). Genes which have no labels are then
reassigned to the models maximizing their likelihood. This iterative procedure is carried out
until convergence of the assignment.
We apply the method to simulated data and to various published data sets and compare them with
purely unsupervised or purely supervised methods.
This poster is referring to the paper 'Using Hidden Markov Models for analyzing gene expression
time course data' by the same authors.