Approaches for Predicting Protein-Protein Interaction Residues from Amino Acid Sequences
Changhui Yan1, Vasant Honavar2, Drena Dobbs
1chhyan@iastate.edu, Iowa State University, IA, USA; 2honavar@cs.iastate.edu, Iowa State University, IA, USA
The ability to identify protein-protein interaction sites and to detect specific amino acid residues that contribute to the specificity and affinity of protein interactions has important implications for problems ranging from rational drug design to analysis of metabolic and signal transduction networks. Because the number of experimentally determined structures for protein-protein complexes is small, computational methods for identifying amino acids that participate in protein-protein interactions are becoming increasingly important.
We have used support vector machines (SVM) and Naive Bayes Classifiers to classify protein surface residues into interface residues and non-interface residues. Our approach relies on information about the sequence neighbors of target residues. (i.e., neighboring residues in primary amino acid sequence, not spatial neighbors). In the experiments reported here, structural information was used only to identify surface residues, which represent, on average, 55% of the total residues in the proteins examined. We used a nonredundant data set of 77 proteins from heterocomplexes in PDB. In this data set, each pair of sequences has sequence identity less than 30%. The dataset include 15 proteins from protease-inhibitor complexes, 12 proteins from antibody-antigen complexes, 13 proteins from Enzyme complexes and 37 proteins from other complexes. The performance of the resulting classifiers was estimated using 77 leave-one-out cross-validation experiments. Several performance measures including accuracy, sensitivity, specificity, and correlation coefficient were computed and averaged across the experiments. The SVM classifiers perform with a correlation coefficient of 0.20, which is much better than random guess (a correlation coefficient of 0), and with accuracy of 64%, sensitivity of 64% and specificity of 65%. The corresponding measures for the Naive Bayes Classifier were a correlation coefficient of 0.20, accuracy of 65%, sensitivity of 65% and specificity of 65%. In comparison, the method of Gallet et al. (Gallet et al, (2000) J. Mol. Biol. 302,917-926), which predicts interface residue by assessing hydrophobic moment, yields a correlation coefficient of -0.02, accuracy of 51%, sensitivity of 51% and specificity of 56%.
Each input for SVM, Naive Bayes and Gallet`s method is a window of 11 amino acid residues that centered at the target residue on protein sequence. The only different is that in the input for SVM each residue is encoded by a vector of 20 elements obtained from HSSP while in the input for Naive Bayes and Gallet’s method each residue is represented by the amino acid identity itself.
When the experiments were repeated on a restricted dataset of 15 proteins from protease-inhibitor complexes, the resulting SVM performs with a correlation coefficient of 0.35, accuracy of 74 %, sensitivity of 74 % and specificity of 72%. The corresponding figures for Naive Bayes were a correlation coefficient of 0.26, accuracy of 70 %, sensitivity of 70 % and specificity of 68%; and for Gallet’s method were a correlation coefficient of -0.08, accuracy of 51 %, sensitivity of 57% and specificity of 59%.
The success of SVM and Naive Bayes Classifiers indicates that the characteristics of sequence neighbors of a target residue are predictive of functional properties of the target residue. SVM and Naive Bayes methods are able to successfully discover and use such features to identify interface residues. The effectiveness of these methods is confirmed by examination of the predictions in the context of 3-dimensional structures in several cases. The fact that SVM and Naive Bayes Classifiers constructed from a restricted data set of proteins drawn from protease-inhibitor complexes perform better than the classifiers generated from a data set of proteins drawn from a more diverse set of protein complexes suggests the possibility that different types protein complexes (protease-inhibitor, antibody-antigen, etc.) differ in terms of sequence features that are predictive of protein-protein interaction. In this study structural information was used to identify surface residues. Our ongoing work is directed at predicting interface residues without relying on structural information to identify surface residues.