BOSS: Boxes of Sequence Similarity
Robert Flegg1, Malcolm Simons2
1robert.flegg@med.monash.edu.au, GeneType Pty. Ltd., Fitzroy Vic 3065, Australia and Victorian Bioinformatics Consortium, PO Box 53, Monash University, Clayton Vic 3800, Australia; 2mjsimons@optusnet.com.au,
The phylogenetic analysis of an alignment assumes that the sequences are derived from
a common ancestor. Recombination confounds this analysis and leads to an alignment
showing a mosaic of blocks with different evolutionary histories. Regions bounded by
recombination events do not conform to gene or intron/exon boundaries and may be
extensive or relatively short. One of us (MJS) has coined the term "phylon" to name
these blocks of evolutionary history.
From an evolutionary point of view, to designate a region as a phylon implies that the
sequences in this region share a common ancestral sequence and that no recombination
event has occurred within it. From an operational point of view a phylon constitutes
a region within an aligned set, or sub-set, of sequences in which the sequences share
some high level of sequence similarity.
It is desirable to identify within a multiple sequence alignment the locations of
recombination events and the set of sequences to which they apply. This poster
describes a program that performs this analysis for very similar sequences. The
alignment used as an illustration is drawn from the MHC region of the human genome.
The MHC region is rich in duplication, recombination and mutation. The level of
sequence similarity is very high and the number of alleles known at different loci
ranges from 511 at the B locus to only 1 at the F locus. The alignment here consists
of 28 sequences spread across 5 major alleles, extends for 320 bases and includes 67
positions with a variation in at least one of the sequences. A phylogenetic analysis
of the block as a whole fails to describe the intricate detail of this alignment.
The program is named BOSS, Boxes of Sequence Similarity, and is written in C using the
EMBOSS libraries. The program analyses a sequence alignment by counting the level of
sequence similarity between every pair of sequences within a sliding window. Regions
with similarity above a user-selected threshold are stored. In the next phase this
pairwise similarity information is parsed to identify the blocks or boxes within the
alignment. Each box consists of a set of sequences that share similarity above the
threshold and the region in the alignment for which this applies.
The program itself has no implied limit on the number of sequences or the length of
sequence that can be analysed. The program lists all of the boxes present in the
alignment that satisfy the threshold criterion. A number of properties are derived for
each box. The properties include the sequences that define the box, its extent within
the alignment and a list of those base positions that can be used to characterise the
box.
The length and depth of each box tells us about the relationships between the included
sequences. BOSS finds boxes that span the whole of the alignment, one for each of the
major alleles. Within these boxes the sequences have a very high level of similarity,
indeed that similarity may extend for thousands of bases beyond the region considered
here. BOSS also finds boxes that span shorter regions but that include sequences from
several major allele classes. These boxes identify regions within the alignment where
the signature of a common ancestor is still present and the boundaries of these boxes
represent the sites of recombination events.