MutationMiner: A Graph Theoretic Approach to Extract Point Mutation Data from Biomedical Literature
Lawrence C. Lee1, Florence Horn2, Fred E. Cohen
1lle8@itsa.ucsf.edu, University of California San Francisco; 2horn@cmpharm.ucsf.edu, University of California San Francisco
MutationMiner is a program which automates extraction of point mutation data from biomedical literature. The goal of MutationMiner is to efficiently extract point mutation information from a large number of journal articles about any protein family and to store the extracted point mutation information in a database. Researchers can then query this database for a particular protein to find experimented point mutations, and then read the corresponding source journal article to find the effects of the mutation. This saves the researcher valuable time from reading up to thousands of journal articles to find relevant mutation information about a target protein. The aggregate data stored in the database can also be used to infer structure-function relationships by examining frequently mutated residues. MutationMiner is implemented in two parts – an Information Retrieval (IR) component and an Information Extraction (IE) component. Given a target protein family, the IR component will query PubMed for a set or relevant articles and download the full text PDF files from the article source. It will also retrieve SwissProt entries for all proteins within the protein family. The IE algorithm searches the full text article using regular expressions to find strings of point mutations, protein names, and organism names. It then utilizes a graph theoretic approach to associate the point mutation to an organism name and a protein name, and then uses this information to corroborate the mutation with the protein SwissProt entry. MutationMiner can search over one thousand journal articles in 24 hours.