The importance of detecting protein-protein interactions from abundant literature is increasing in the post-genomic era. To date, extractions of descriptions of protein-protein interactions from texts have been studied using Natural Language Processing (NLP) techniques. However, few studies on detection of implicit or potential interactions exist.
In this study, we propose a method to detect explicit or implicit protein-protein interactions from text data. In this method, each protein is represented by a vector that consists of weighted term frequency in a combination of documents which refer to the protein or its coding gene. Moreover, the distance between every pair of the vectors is calculated as a measure of interaction.
We have applied this method to detect protein-protein interactions of yeast (Saccharomyces cerevisiae). To make vectors of proteins, we have collected PubMed (http://www.ncbi.nlm.nih.gov/) abstracts that include gene names or alias names in the Saccharomyces Gene Database (SGD; http://genome-www.stanford.edu/Saccharomyces/), yielding 3,841 protein vectors. Using these vectors, the distances between all protein pairs are calculated and ranked in ascending order. Finally, putative interactions are obtained by setting a threshold. The putative interactions are validated using 3,292 nonredundant pairs of yeast proteins extracted from the Biomolecular Interaction Network Database (BIND; http://www.bind.ca/index.phtml). As a result, 107 BIND interactions are detected within 3,000 putative interactions, whereas only 79 BIND interactions are detected within 3,442 putative interactions by the existing method, which uses cooccurrences of gene names within PubMed abstracts. Furthermore, the putative interactions are compared with some results of two-hybrid systems, some of which are not referred to in any PubMed abstracts. They show good agreement. This result indicates that the putative interactions can contain true but experimentally unidentified interactions.