Extraction of Pathways Involved in Microarray Time Course Experiments

Christine Steinhoff¹, Tobias Mueller², Hannes Luz, Martin Vingron
¹christine.steinhoff@molgen.mpg.de, Max Planck Institute; ²Tobias.Mueller@biozentrum.uni-wuerzburg.de, Biocenter, University Würzburg

Motivation.

Using microarray technology, cell response, differentiation and other cellular processes can be monitored for thousands of genes simultaneously along a time course. Most times cellular responses initially lead to very small changes in expression. Since microarray experiments are known to be very noisy it is difficult if not impossible, to recover whole pathways exclusively based on such data. Therefore, it is natural and important to improve the signal-to-noise ratio by including additional sources of information in the analysis. Furthermore, many time course analysis tools are suboptimally designed, e.g., they typically do not take into account the time dependencies of the underlying microarray data. Thus, time courses of microarray experiments should involve algorithms which take into account the dependencies of gene profiles in time and are able to extract the underlying pathways reliably.

Results.

The main goal of our approach is the interpretation of gene expression time course profiles resulting from microarray experiments. This should be achieved by extracting common transcription factor binding patterns and the potentially involved pathways coordinately.

In order to analyze gene expression and at the same time extract regulatory pathways which are involved in the underlying experiment, we propose an approach that integrates four different sources of biological knowledge: 1) microarray experiment, 2) transcription factor binding site patterns in genes' upstream regions (TFBS), 3) annotation of functional categories and 4) the KEGG database of biologically verified pathways. This method allows us to use prior biological knowledge to explain an expression profile time series.

The proposed clustering of gene expression time courses is based on a k-means approach. We use a new clustering algorithm introducing a distance measure that is a mixture of k-means clustering as well as clustering according to the first derivative at each time point. We demonstrate that although this method is very simple and fast, we get good results by applying this more adequate distance measure to gene expression time courses. By choosing the mixing parameter we can either focus more on the shape (derivative) of the time course or on the difference of the particular measurements of the time courses, and we will discuss optimization strategies for this parameter. Using this method we get an initial grouping of genes having similar (in shape and in distance) expression patterns. This grouping detects cyclic as well as non-cyclic patterns and summarizes genes without prevalent regulatory changes in specific clusters. Thus, we do not have to exclude genes in the beginning. Furthermore, the algorithm is applicable as well to cyclic data as non-cyclic data which is the case for example for differentiation processes. Even for mainly cyclic data one expects to see cell culture specific events or apoptotic events which can easily be detected in specific clusters and can be further analyzed. We demonstrate this using a published yeast cell cycle dataset (Spellman et al., Mol Biol Cell 9, 1998) and a non-cyclic dataset of fibroblasts (Iyer et al., Science 283 , 1999). We further analyze cyclic as well as non-cyclic clusters. In the next step we look for common patterns of transcription factor binding sites. These are modeled by profiles taken from the TRANSFAC database and they are resolved accurately and efficiently by a method we developed earlier. This is described in (Rahmann et al., submitted).

Putting all information together for each gene, we have its expression profile series, pattern of transcription factor binding sites, functional categories and the associated pathway(s) at hand. So, genes occurring in the same gene expression cluster, having a similar pattern of TFBSs and functional categories are very likely to be coregulated. This finding could be approved using the KEGG database.

Finally, we examine whether the results extracted from the different sources of biological databases are clearly supported by the underlying microarray experiment and discard possible false positives which result from the preceding steps.
In future we want to generalize this method where the clustering is applied simultaneously to all data representing a gene.