Intimately Incorporated NLP System Adapted for Bio-Text Mining

Young-Sook Hwang1, Hae-Chang Rim2, Kyoung-MePark, Ki-Joong Lee, Hong-Woo Chun
1yshwang@nlp.korea.ac.kr, Korea Univ.; 2rim@nlp.korea.ac.kr, Korea Univ.

BioNLPro is a system for providing the base for a robust bio-text mining system. It is an intimately integrated NLP system consisting of the adapted core NLP modules reflecting the peculiarities of bio-text including a POS tagger, a biological term recognizer, a grammatical relation tagger based on chunking and a biological event extractor. The POS tagger based on HMM is adapted for tagging biological terms as well as general words. It is incorporated with a preprocessor identifying biological terms. The biological term identifier recognizes the boundary of a term and assigns a tag to it. A POS-tagged sentence is given to the biological term classifier based on SVMs. It classifies the biological terms into one of the semantic classes such as gene, protein, enzyme, etc. With the result of the biological term recognition, the task of base phrase chunking including a PP attachment task is performed. Then the grammatical relation tagger detects the dependency between base phrases, and assigns a grammatical relation to it. Finally, the event extractor designed by considering the patterns of the verbs is processed to extract protein-protein or gene-gene interactions from the designated sentences. For training the biological term recognizer and the event extractor, we have built biological term dictionaries by utilizing the public databases such as Swiss-Prot, PDB, GenBank and SGD, and annotated the Yeast domain MEDLINE abstracts with the terms and events by using an integrated resource construction tool.