Structural Classification in the Gene Ontology
Cliff Joslyn1, Susan Mniszewski2, Andy Fulmer, Gary Heaton
1joslyn@lanl.gov, Los Alamos National Laboratory; 2smm@lanl.gov, Los Alamos National Laboratory
Use of ontological structures such as the Gene
Ontology (GO) [1] are increasingly a standard part of a typical biologist's
work day. We have been pursuing work in structural classification of the
GO: given a list of genes of interest, how are they organized with respect
to the GO? Are they centralized, dispersed, grouped in one or more clusters?
With respect to the biological functions which make up the GO, do the genes
represent a collection of more general or more specific functions, a coherent
collection of functions or distinct functions? Existing approaches to these
questions [2,3] have relied on the statistics of how ontology nodes are
generally populated, and/or use a distance based on the minimal path length
between two nodes [4]. Our approach [5] is based on the following principles:
-
While we welcome the use of node statistics as supplementary information,
it is also important to be able to provide answers when such information
is lacking, that is, based only on the structural relations among
the nodes in the ontology implicated by the genes of interest.
-
Ontologies such as the GO share with object-oriented hierarchies a common
mathematical structure called a labeled poset: a collection of partially
ordered sets (posets, equivalent to Directed Acyclic Graphs (DAGs)), each
one representing a different semantic category. In the GO there are two
posets of is-a and has-part
links respectively. Compared to more familiar mathematical structures
such as trees, lattices, networks, or Euclidean spaces, we have fewer good
intuitions and techniques about posets.
We present our approach to structural classification in the GO based on
pseudo-distances
in posets. Our system, the Gene Ontology Clusterer (GOC), uses pseudo-distances
between comparable nodes only, in conjunction with scoring algorithms,
to rank-order the GO nodes with respect to the requested genes. We will
also present the lessons we've learned about working with the GO, in particular
the following kinds of issues:
-
Classification in the GO involves two interrelated concepts:
coverage
is the idea that a given node should cover as many of the genes of interest
as possible, while specificity is the idea that it should do so
as precisely or specifically as possible. These ideas are conflicting:
the top of the ontology always provides complete coverage but minimal specificity,
while identifying any individual node containing a gene of interest provides
complete specificity with minimal coverage. Identification of clusters
is thus not unequivocal, but rather a user-dependent judgment about the
tradeoff of specificity and coverage.
-
The GO is widely and legitimately championed as being superior to other
systems in that it is DAG-based, and not a tree. But the consequences
of this are not nearly as widely recognized. In particular, our intuitions
about concepts of how ``levels'' work, and the relations among nodes at
different levels, can be quite deceptive; and statistical approaches which
are quite clean in trees can become non-additive with DAGs.
-
Moreover, tree-based software dominates GO interfaces. Visualization of
DAGs is especially important, for example to interpret the outputs of our
GOC system.
-
Finally, there is a need for a broader mathematical and statistical analytical
base of understanding of the GO, including such questions as the distributions
of leaves and roots; the distribution of up-branching and down-branching;
overall path length statistics; distribution of genes, both through the
GO and with respect to multiple genes per node; and areas of tree-ish and
lattice-like regions within the full GO.
[1] Ashburner, M; Ball, CA; and Blake, JA et al.: (2000) ``Gene
Ontology: Tool For the Unification of Biology'', Nature Genetics,
25:1, pp. 25-29
[2] Lord, Phillip; Stevens, Robert; and Brass, A et al.: (2002)
``Semantic Similarity Measures Across the Gene Ontology: Relating Sequence
to Annotation'', in: Proc. Intelligent Systems for MicroBiology
(ISMB 02)
[3] Resnik, Philip: (1999) ``Semantic Similarity in a Taxonomy: An
Information-Based Measure and Its Application to Problems in Ambiguity
in Natural Language'', J. Artificial Intelligence Research, v.
11, pp. 95-130
[4] Rada, Roy; Mili, Hafedh; and Bicknell, E et al.: (1989)
``Development and Application of a Metric on Semantic Nets'', IEEE
Trans. on Systems, Man and Cybernetics, 19:1, pp. 17-30
[5] Joslyn, Cliff; Mniszewski, Susan; and Fulmer, A, et al.:
(2003) ``Measures on Ontological Spaces of Biological Function'', Pacific
Symposium on Biocompuating (PSB 03), ftp://ftp.c3.lanl.gov/pub/users/joslyn/psb03f.pdf