In order to analyze gene expression and at the same time extract regulatory pathways which are involved in the underlying experiment, we propose an approach that integrates four different sources of biological knowledge: 1) microarray experiment, 2) transcription factor binding site patterns in genes' upstream regions (TFBS), 3) annotation of functional categories and 4) the KEGG database of biologically verified pathways. This method allows us to use prior biological knowledge to explain an expression profile time series.
The proposed clustering of gene expression time courses is based on a k-means approach. We use a new clustering algorithm introducing a distance measure that is a mixture of k-means clustering as well as clustering according to the first derivative at each time point. We demonstrate that although this method is very simple and fast, we get good results by applying this more adequate distance measure to gene expression time courses. By choosing the mixing parameter we can either focus more on the shape (derivative) of the time course or on the difference of the particular measurements of the time courses, and we will discuss optimization strategies for this parameter. Using this method we get an initial grouping of genes having similar (in shape and in distance) expression patterns. This grouping detects cyclic as well as non-cyclic patterns and summarizes genes without prevalent regulatory changes in specific clusters. Thus, we do not have to exclude genes in the beginning. Furthermore, the algorithm is applicable as well to cyclic data as non-cyclic data which is the case for example for differentiation processes. Even for mainly cyclic data one expects to see cell culture specific events or apoptotic events which can easily be detected in specific clusters and can be further analyzed. We demonstrate this using a published yeast cell cycle dataset (Spellman et al., Mol Biol Cell 9, 1998) and a non-cyclic dataset of fibroblasts (Iyer et al., Science 283 , 1999). We further analyze cyclic as well as non-cyclic clusters. In the next step we look for common patterns of transcription factor binding sites. These are modeled by profiles taken from the TRANSFAC database and they are resolved accurately and efficiently by a method we developed earlier. This is described in (Rahmann et al., submitted).
Putting all information together for each gene, we have its expression profile series, pattern of transcription factor binding sites, functional categories and the associated pathway(s) at hand. So, genes occurring in the same gene expression cluster, having a similar pattern of TFBSs and functional categories are very likely to be coregulated. This finding could be approved using the KEGG database.
Finally, we examine whether the results extracted from the different sources of
biological databases are clearly supported by the underlying microarray experiment
and discard possible false positives which result from the preceding steps.
In future we want to generalize this method where
the clustering is applied simultaneously to all data representing a gene.