Statistical Characterization of Spervised Learning and Gene Selection Algorithms for Gene Expression Analysis

Eisaku Maeda1, Ichiro Takemasa2, Tomonori Izumitani, Hirotoshi Taira, Kenichi Matsubara, Morito Monden
1maeda@cslab.kecl.ntt.co.jp, NTT Communication Science Laboratories; 2alfa-t@sf6.so-net.ne.jp, Graduate School of Medicine, Osaka University

Advanced informatics for gene expression profiling is critically important for cancer diagnosis and treatment. The supervised learning technique, in particular, is a powerful tool for cancer-class prediction, and a feature selection technique is needed to identify marker genes for this prediction. Although an appropriate algorithm for dealing with the problem should be chosen from the various proposed supervised learning and gene selection methods, a guideline for the choice of method has yet to be established. We focused on histopathological phenotype prediction in colorectal carcinoma based on the expression data of selected gene sets, and investigated various combinations of classification and gene selection algorithms: namely, the k-nearest neighbor method (KNN), the linear discriminant function, and support vector machines (SVM) for a classifier, and the signal to noise ratio (S2N) based method, recursive feature elimination (RFE), and random selection (RND) for gene selection. In addition, we introduced a random sub-sampling technique for the reliable evaluation of prediction accuracy. The experimental results demonstrated the following: (1) a gene set selected by RFE algorithms is completely different from that selected with S2N, (2) an appropriate gene selection method should be decided according to the type of classifier employed, (3) a combination of SVM and RFE techniques provides better prediction performance for a wide range of selected-gene number, (4) the prediction accuracy estimated with a sub-sampling technique is more reliable and informative than that estimated with a leaving-one-out (LOO) validation.