Structural Classification in the Gene Ontology

Cliff Joslyn¹, Susan Mniszewski², Andy Fulmer, Gary Heaton
¹joslyn@lanl.gov, Los Alamos National Laboratory; ²smm@lanl.gov, Los Alamos National Laboratory

Use of ontological structures such as the Gene Ontology (GO) [1] are increasingly a standard part of a typical biologist's work day. We have been pursuing work in structural classification of the GO: given a list of genes of interest, how are they organized with respect to the GO? Are they centralized, dispersed, grouped in one or more clusters? With respect to the biological functions which make up the GO, do the genes represent a collection of more general or more specific functions, a coherent collection of functions or distinct functions? Existing approaches to these questions [2,3] have relied on the statistics of how ontology nodes are generally populated, and/or use a distance based on the minimal path length between two nodes [4]. Our approach [5] is based on the following principles:

While we welcome the use of node statistics as supplementary information, it is also important to be able to provide answers when such information is lacking, that is, based only on the structural relations among the nodes in the ontology implicated by the genes of interest.
Ontologies such as the GO share with object-oriented hierarchies a common mathematical structure called a labeled poset: a collection of partially ordered sets (posets, equivalent to Directed Acyclic Graphs (DAGs)), each one representing a different semantic category. In the GO there are two posets of is-a and has-part links respectively. Compared to more familiar mathematical structures such as trees, lattices, networks, or Euclidean spaces, we have fewer good intuitions and techniques about posets.

We present our approach to structural classification in the GO based on pseudo-distances in posets. Our system, the Gene Ontology Clusterer (GOC), uses pseudo-distances between comparable nodes only, in conjunction with scoring algorithms, to rank-order the GO nodes with respect to the requested genes. We will also present the lessons we've learned about working with the GO, in particular the following kinds of issues:

Classification in the GO involves two interrelated concepts: coverage is the idea that a given node should cover as many of the genes of interest as possible, while specificity is the idea that it should do so as precisely or specifically as possible. These ideas are conflicting: the top of the ontology always provides complete coverage but minimal specificity, while identifying any individual node containing a gene of interest provides complete specificity with minimal coverage. Identification of clusters is thus not unequivocal, but rather a user-dependent judgment about the tradeoff of specificity and coverage.
The GO is widely and legitimately championed as being superior to other systems in that it is DAG-based, and not a tree. But the consequences of this are not nearly as widely recognized. In particular, our intuitions about concepts of how ``levels'' work, and the relations among nodes at different levels, can be quite deceptive; and statistical approaches which are quite clean in trees can become non-additive with DAGs.
Moreover, tree-based software dominates GO interfaces. Visualization of DAGs is especially important, for example to interpret the outputs of our GOC system.
Finally, there is a need for a broader mathematical and statistical analytical base of understanding of the GO, including such questions as the distributions of leaves and roots; the distribution of up-branching and down-branching; overall path length statistics; distribution of genes, both through the GO and with respect to multiple genes per node; and areas of tree-ish and lattice-like regions within the full GO.

[1] Ashburner, M; Ball, CA; and Blake, JA et al.: (2000) ``Gene Ontology: Tool For the Unification of Biology'', Nature Genetics, 25:1, pp. 25-29
[2] Lord, Phillip; Stevens, Robert; and Brass, A et al.: (2002) ``Semantic Similarity Measures Across the Gene Ontology: Relating Sequence to Annotation'', in: Proc. Intelligent Systems for MicroBiology (ISMB 02)
[3] Resnik, Philip: (1999) ``Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems in Ambiguity in Natural Language'', J. Artificial Intelligence Research, v. 11, pp. 95-130
[4] Rada, Roy; Mili, Hafedh; and Bicknell, E et al.: (1989) ``Development and Application of a Metric on Semantic Nets'', IEEE Trans. on Systems, Man and Cybernetics, 19:1, pp. 17-30
[5] Joslyn, Cliff; Mniszewski, Susan; and Fulmer, A, et al.: (2003) ``Measures on Ontological Spaces of Biological Function'', Pacific Symposium on Biocompuating (PSB 03), ftp://ftp.c3.lanl.gov/pub/users/joslyn/psb03f.pdf