Classification of Virus Risk Types Using Kernel-Based Classifiers

Je-Gun Joung1, Sirk June Augh2, Byoung-Tak Zhang
1jgjoung@bi.snu.ac.kr, Graduate Program in Bioinformatics, Seoul National University, Korea; 2sjaugh@cbit.snu.ac.kr, Center for Bioinformation Technology, Seoul National University, Korea

Classification of virus risk types is important to understand the mechanisms in infection and to develop novel instruments for medical examination such as DNA chips. Human papillomaviruses (HPVs) are a small DNA tumor virus which infects epithelial tissues and induces hyperproliferative lesions. Infection by high-risk genital HPVs is associated with the development of anogenital cancers. Recently more than 120 have been partly reported. String kernel-based classifiers provide efficient computation to the classification of virus risk types. The string kernels function as mapping from sequences to feature space. They are very powerful to the analysis of biological sequence data, because they can extract important features from input sequences. When the classifier learns informative subsequences, the accuracy is better. The informative subsequence might indicate the highly conserved region. Therefore, we classified for HPV protein subsequences after preprocessing using HMMs. We predicted the risk type for all types via leave-one-out cross-validation. In the experiments, the classifier predicted four unknown HPV types exactly. An additional result shows that string kernel-based classifiers learned with more informative subsequences outperform the classifiers learned with whole sequence or random subsequences.