On the Sequence Pattern Distribution in Splice Junctions. An Analysis Using Information Theoretic and Machine Learning
Christina Zheng1, Virginia R de Sa2, Michael Gribskov, T. Murlidharan Nair
1nair@sdsc.edu, UCSD SDSC; 2desa@cogsci.ucsd.edu, UCSD
Recognition of precise splice junctions is a challenge faced in the
analysis of newly sequenced genomes. This challenge is compounded by the
fact that the distribution of sequence patterns in these regions are not
always distinct. With a view to understand the sequence signatures at the
splice junctions, neural network based calliper randomization and
information theoretic based feature selection approaches have been used in
the analysis of the sequences at this region. This has been done in an
effort to understand the regions that harbor information content and to
extract elements that are relevant for splice site prediction.
Results: The analysis of the sequences at the splice junction using a
neural network based calliper randomization approach reveals the regions
that are important in the internal representation of the network model.
Further, analysis of the region using the feature selection approach
revealed a subset of features where the information are concentrated.
Comparative analysis of the results using both the methods help to infer
about the kind of information present in the region.