New approach to build models for predicting prokaryotic genes

Chungoo Park1, Mihwa Park2, Jongwon Chang, Jeongho Huh, Dong Soo Jung, Hong Gil Nam, Young Bock Lee, Jiin Choi, Seungsik Yoo, Jaewoo Kim
1madreach@bric.postech.ac.kr, Biological Research Information Center, Pohang University of Science and Technology; 2bfpark@posdata.co.kr, Solution Development Research Institute,POSDATA

According to the increasing rate of prokaryotic genome sequencing projects, accurate prokaryotic gene prediction is becoming more important. Current computational gene predicting methods frequently employ statistical methods based on Markov models such as GeneMark[1] and GLIMMER[2]. The performance of statistical methods which include the GLIMMER, the best popular prokaryotic gene prediction tool, depends on learning data. The existing GLIMMER uses long ORFs as training data for the model. To predict more accurate protein-coding region the existing GLIMMER was recommended to use the prior biological knowledge related to the sknown genes. However, the main goal of genome annotation is exhaustive gene searches, which eventually lead to the discovery of surprising and atypical features. From this point of view it is inappropriate to use the known genes for increasing the gene prediction accuracy. Here we propose a new method for increasing the gene prediction accuracy without using information of known genes. Our goal is to identify additional protein-coding regions which can not be predicted by the existing GLIMMER based on ORFs learning data. To increase the gene prediction accuracy we used the additional learning data through the phylogenetic concept. Tests on 3 complete prokaryotic genomes performed with the GLIMMER program demonstrate the ability of the new approach to detect additional genes. This approach will contribute to predict candidate genes in the future prokaryotic whole genome sequencing projects. 1. Borodovsky M. and McIninch J. GeneMark parallel gene recognition for both DNA strands, Computers and Chemistry, 1993, Vol. 17, No. 19, pp. 123-133. 2. AL Delcher, D Harmon, S Kasif, O White, and SL Salzberg, Improved microbial gene identification with GLIMMER Nucl. Acids. Res. 1999 27. 4636-4641.