cSAGE and the Serial Analysis of Gene Expression in Arabidopsis thaliana
Christopher T Lewis1, Stephen Robinson2, Tony Kusalik, Isobel AP Parkin
1LewisCT@agr.gc.ca, Agriculture and Agri-food Canada; 2RobinsonS@agr.gc.ca, Agriculture and Agri-food Canada
The Serial Analysis of Gene Expression (SAGE) is based on the premise that a short signature sequence
derived from the 3’ UTR of a transcript is sufficient to uniquely identify a gene within a sequenced
organism. The protocol employs a series of restrictions and ligations to acquire the specific fragments
(SAGE tags). The efficiency of the protocol is enhanced as pairs of SAGE tags are ligated to form ditags,
which are amplified and concatenated to form ditag chains before they are cloned and sequenced.
The resulting sequence reads contain 400-750 bases consisting of 16-28 ditags separated by the anchoring
enzyme's recognition sequence (i.e. 'CATG' for NlaIII). Valid ditags are extracted from the sequence read
and the frequency of individual SAGE tags within the library is determined, which provides an accurate
quantitative estimate of the transcriptome. Valid ditags have a defined length (24-26 bases) and may not
contain identical tags. Duplicate ditags may be formed legitimately from highly expressed genes, but they
are excluded from further analysis as they may result from biased PCR amplification. Errors within the
analysis might occur due to infidelity of DNA replication during the PCR or sequencing reactions.
cSAGE provides an efficient mechanism for extracting SAGE tags and matching them with virtual SAGE
tags derived from DNA sequence databases. SAGE tags are extracted from the sequence reads in linear
time using a state machine, and stored in a 5-ary tree with nodes representing the bases {A,C,T,G,N}. This
tree enables rapid detection of duplicate ditags and efficient tag-to-gene matching. Virtual SAGE tags
extracted from the DNA databases are used to search the tree for matches. Known vector and linker tags
can be excluded from analysis by placing them in an "exclude" file. Sequence reads may be in either
FASTA or PHD format (output from Phred) and DNA database sequences must be in FASTA format.
PHD format sequence reads allow the experimental SAGE tags to be screened for sequence quality prior to
analysis.
Highlights from a cold acclimation experiment in A. thaliana using the SAGE protocol and cSAGE
include: 92,290 valid ditags containing 184,580 SAGE tags from which 146,178 had an average phred
quality greater than 20. Removal of polyA and linker tags provided a final set of 145,170 tags. This set
contained 29,663 (20.4%) unique tags, of which 16,664 (11.5%) were present as singletons. Of the unique
tags 89% matched a gene: 46% of tags matched the canonical (3' most) recognition site and 43% matched a
non-canonical site. Non-canonical matches are explained in four ways: incomplete digestion of the mRNA,
alternate splicing of the gene, misannotation of the gene, or anti-sense transcription of the gene. Alternate
splicing has been confirmed for a small number of the non-canonical matches.
cSAGE is an open-source, freely available application written in C. It is intended to fit into a larger
analysis pipeline, for instance a PERL script is used to compare two cSAGE reports and display tags with a
significant change in expression. A modular design facilitates the extension of cSAGE for new
applications. For more information on cSAGE see http://homepage.usask.ca/\~ctl271/csage.