We apologize for an experimental methodology flaw of the paper "Gene selection for Cancer Classification". The problem concerns mostly Figure 4 (colon cancer) where several methods are being compared with the leave-one-out cross-validation method (LOO). Here, LOO was performed to assess the performance of the classifiers, using a fixed set of features previously selected with the whole training data set. The result tables of both Leukemia and colon cancer quote both these bad LOO results and test set results.
The proper way to conduct LOO for feature selection is to avoid using a fixed set of features selected with the whole training data set, because this induces a bias in the results. Instead, one should withhold a pattern, select features, and assess the performance of the classifier with the selected features using the left out example. One then rotates over all the examples, recomputing the feature set and the classifier parameters each time. Note that in this way, the performance of a classifier using a given number of features can be assessed, not the predictive power of a given feature subset. This later problem is better addressed using an independent test set.
Preprocessing the entire dataset before conducting the experiments also may introduce bias. Therefore, it is advisable to include the preprocessing in the LOO loop as well.
Several papers have made similar mistakes, as pointed out recently in:
Selection bias in gene extraction on the basis of microarray gene-expression data, Christophe Ambroise and Geoffrey J. McLachlan, PNAS 2002 99: 6562-6566. http://www.pnas.org/cgi/reprint/99/10/6562.pdf
This paper provides valuable suggestions on how to conduct experiments properly. The original paper of Golub et al. that inspired our work had a proper experimental design, although we overlooked the methodology, made explicit in footnote 22. See Golub et al (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science, 531-537. http://www.ib3.gmu.edu/gref/S02/csi739/golub_science1999.pdf
Other papers have benchmarked SVM RFE (and variants of the algorithm) against other methods, including:
S. Ramaswamy et al. Multiclass cancer diagnosis using tumor gene expression
signatures. PNAS, vol.98, No26, pp. 15149-15154, December, 2001.
See also: http://www-genome.wi.mit.edu/mpr/publications/projects/Global_Cancer_Map/PNAS_Supplementary_Information.pdf
B. Scholkopf, I. Guyon, and J. Weston. Statistical learning and kernel
methods in bioinformatics. In proceedings NATO Advanced Studies Inst. on Artificial
Intelligence and Heuristics Methods for Bioinformatics, San Miniato, Italy
1-11. 2001 http://www.clopinet.com/isabelle/Papers/kerbioinfo.pdf
J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping. Use of the
zero norm with linear models and kernel methods. JMLR, 3(Mar):1439-1461, 2003.
A. Rakotomamonjy. Variable selection using SVM-based criteria. JMLR special
Issue on Variable and Feature selection, JMLR, 3(Mar):1357-1370, 2003.
C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Entropy-Based Gene
Ranking without Selection Bias for the Predictive Classification of Microarray
Data. BMC Bioinformatics, (4):54, 2003. http://www.biomedcentral.com/1471-2105/4/54
C. Furlanello, M. Serafini, S. Merler, and G. Jurman. An accelerated procedure for recursive feature ranking on microarray data. Neural Networks,
16(5-6):641-648, 2003. http://mpa.itc.it/papers/biblio03-bib.html#merler2003anaccelerated
Y. Li, C. Campbell, and M. Tipping. Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18 (10) pp. 1332-1339, 2002. http://bioinformatics.cs.vt.edu/~easychair/LiCampbellTipping_Bioinformatics_2002.pdf
J. Zhu and T. Hastie Classification of Gene Microarrays by Penalized Logistic Regression (to appear, Biostatistics) . http://www-stat.stanford.edu/~hastie/Papers/plr.ps
P. M. Long and V. B. Vega. Boosting and microarray data. Machine Learning, 52(1):31-44, 2003. http://www1.cs.columbia.edu/~plong/publications/boosting_microarray.pdf
J. Zhu, S. Rosset, T. Hastie and, R. Tibshirani, 1-norm support vector machines. In Proceedings NIPS 2003. http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf
C. Gentile. Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms. In Proceedings NIPS 2003. http://books.nips.cc/papers/files/nips16/NIPS2003_AA16.pdf
H. Fröhlich, O. Chapelle, and B. Schölkopf Feature Selection for Support Vector Machines by Means of Genetic Algorithms, 15th IEEE International Conference on Tools with Artificial Intelligence, 2003. http://www-ra.informatik.uni-tuebingen.de/mitarb/froehlich/Publikationen/342_froehlich_hf.ps
K. Fujarewicz and M. Wiench. Selecting differentially expressed genes for colon tumor classification. Int. J. Appl. Math. Comput. Sci., 2003, Vol. 13, No. 3, 327–335 http://matwbn.icm.edu.pl/ksiazki/amc/amc13/amc1330.pdf
W. S. Noble. Support vector machine applications in computational biology. In Kernel Methods in Computational Biology. B. Schoelkopf, K. Tsuda and J.-P. Vert, eds. MIT Press, 2004. http://noble.gs.washington.edu/papers/noble_support.html
B. Krishnapuram, L. Carin, A. Hartemink, Gene Expression Analysis:
Joint Feature Selection and Classifier Design. In Kernel Methods in Computational
Biology, Schölkopf, B., Tsuda, K., & Vert, J.-P., eds. MIT Press,
Guido Steiner; Laura Suter; Franziska Boess; Rodolfo Gasser; Maria Cristina de Vera; Silvio Albertini; Stefan Ruepp, Discriminating Different Classes of Toxicants by Transcript Profiling, 2004. http://www.medscape.com/viewarticle/489069_1
Rong Xiao, Boosting Chain Learning for Object Detection, Paper ID _ 450,
Lal, T. N., Hinterberger, T., Widman, G., Schröder, M., Hill, J.,
Rosenstiel, W., Elger, C. E., Schölkopf, B. and Birbaumer, N.: Methods
Towards Invasive Human Brain Computer Interfaces.
NIPS 2004. http://www-ti.informatik.uni-tuebingen.de/%7Eschroedm/papers/NIPS2004.pdf
Asa Ben-Hur 1 and Douglas Brutlag Sequence motifs: highly predictive features
of protein function. In Feature extraction, fundamentals and applications,
I. Guyon et al Eds. Springer, to appear.
Textbooks teaching RFE:
Support Vector Machine In Chemistry
Author NIANYI CHEN
Analyzing Microarray Gene Expression Data (Wiley Series in Probability and Statistics)-US-
McLachlan, Geoffrey J. /Do, Kim-Anh /Ambroise, Christophe /DO, K.-A /A Publisher:Wiley-Interscience
Software packages implementing the algorithm are also available:
Gist - a C package by William Stafford Noble and Paul Pavlidis:
PyML - a Python Machine Learning package by Asa Ben-Hur
Spider - a Matlab package by Jason Weston, Andre Elisseeff , Goekhan Bakir
, Fabian Sinz