Regression analysis in optimal gene selection for DNA microarray analysis

Si-Ho Yoo1, Sung-Bae Cho2
1bonanza@candy.yonsei.ac.kr, Yonsei University; 2sbcho@cs.yonsei.ac.kr, Yonsei University

It is important to select informative genes, which are related to cancer because not all genes are needed for classifying the cancer. Many gene selection methods have been studied, but it is hard to find out the perfect one. The purpose of this study is to investigate how to find the optimal informative genes to classify the gene expression profiles. We have applied regression analysis in gene selection to find the informative genes to predict cancer. Unlike the previous works on gene selection, forward selection method in regression analysis takes care about the relations between selected genes to minimize redundant information about cancer. Reducing the redundant information about the cancer in the selected genes helps classifying the cancer. We propose a new gene selection algorithm based on forward selection method in regression analysis. The algorithm of the proposed method selects genes by their ability to explain the target(cancer). Although the each individual gene selected with this method has power to explain the cancer, the set of individual gene could have redundant information about cancer. The proposed method selects gene which is good at explaining the part that are not explainable by the former selected genes. The fitness of these selected genes is tested by their F-value for their confidence in regression model. With the confidence level of the gene, the regression model determines whether to select the gene. In experiment, we have used colon dataset to demonstrate the usefulness of the proposed method. We compared the results with Pearson’s and Spearman’s correlation coefficients methods. With the proposed method, we have selected the most informative genes and measured their sensitivity, specificity, and recognition rate. Selecting genes with forward selection method is more powerful than the other gene selection methods in cancer prediction. It shows high accuracy of sensitivity (95.0%), specificity (82.0%), and recognition rate (90.3%) for colon cancer classification.