Hierarchical-Partitioning: A New Clustering Framework for Gene Expression Data Analysis
Alan Wee-Chung Liew1, Hong Yan2, Lap Keung Szeto
1itwcliew@cityu.edu.hk, Dept of Computer Engineering and Information Technology, City University of Hong Kong; 2ityan@cityu.edu.hk, Dept of Computer Engineering and Information Technology, City University of Hong Kong
Clustering has been shown to be a useful tool for the exploration of large-scale gene expression profiles. Traditional clustering techniques
can generally be classified into two categories, hierarchical and partitional. The former results in nested clusters, and the latter in non-nested
clusters. Hierarchical clustering algorithm transforms a pairwise dissimilarity matrix of patterns into a sequence of nested partitions, such as a
dendrogram. Partitional clustering, on the other hand, performs a partition of patterns into K clusters, such that patterns in a cluster are more
similar to each other than to patterns in different clusters. Both categories of clustering algorithms have their merits and weaknesses and both
have been used extensively in gene expression data study (see for example, the cluster analysis packages by Standford Microarray
Database: http://genome-www5.stanford.edu/MicroArray/SMD/restech.html). In our recent work, we attempt to combine the features of
both classes of algorithms using a novel hierarchical-partitioning framework which we called the binary hierarchical clustering. In essence,
our algorithm performs a successive binary subdivision of the data into smaller and smaller partitions in a hierarchical manner, until any further
splitting of a partition into two smaller partitions is insignificant anymore. The hierarchical structure is manifested in the binary tree structure
of the clustering result, where a parent node gives rise to two children nodes if the projection onto the optimal fisher discriminant axis satisfies
a certain threshold. The partitioning behavior of our algorithm is incorporated in the cluster splitting process, where the fuzzy C-means
clustering algorithm is used to split a parent cluster into two children clusters. The novel structure of our algorithm enables the cluster number
to be estimated based on a mathematically well-defined index. Moreover, the binary tree structure also allows a subjective judgment of the
number of cluster similar to that found in conventional hierarchical clustering. Meaningful visualization of the clustering result is important for
interpretation. Our clustering framework also leads to a tree structure representation, which display the nested relationship between
cluster partitions of finer and finer resolution. We believe that our algorithm will add to the set of valuable data analysis tools for the biologists.