Statistical Characterization of Spervised Learning and Gene Selection Algorithms for Gene Expression Analysis
Eisaku Maeda1, Ichiro Takemasa2, Tomonori Izumitani, Hirotoshi Taira, Kenichi Matsubara, Morito Monden
1maeda@cslab.kecl.ntt.co.jp, NTT Communication Science Laboratories; 2alfa-t@sf6.so-net.ne.jp, Graduate School of Medicine, Osaka University
Advanced informatics for gene expression profiling is critically
important for cancer diagnosis and treatment. The supervised learning
technique, in particular, is a powerful tool for cancer-class
prediction, and a feature selection technique is needed to identify
marker genes for this prediction. Although an appropriate algorithm
for dealing with the problem should be chosen from the various
proposed supervised learning and gene selection methods, a guideline
for the choice of method has yet to be established.
We focused on histopathological phenotype prediction in colorectal
carcinoma based on the expression data of selected gene sets, and
investigated various combinations of classification and gene selection
algorithms: namely, the k-nearest neighbor method (KNN), the linear
discriminant function, and support vector machines (SVM) for a
classifier, and the signal to noise ratio (S2N) based method, recursive
feature elimination (RFE), and random selection (RND) for gene
selection. In addition, we introduced a random sub-sampling technique
for the reliable evaluation of prediction accuracy.
The experimental results demonstrated the following: (1) a gene set
selected by RFE algorithms is completely different from that selected
with S2N, (2) an appropriate gene selection method should be decided
according to the type of classifier employed, (3) a combination of SVM
and RFE techniques provides better prediction performance for a wide
range of selected-gene number, (4) the prediction accuracy
estimated with a sub-sampling technique is more reliable and informative
than that estimated with a leaving-one-out (LOO) validation.