Hands-on pattern recognition

challenges in data representation, model selection, and performance prediction

Isabelle Guyon, Gavin Cawley, Gideon Dror, and Amir Saffari Editors

Challenges in Machine Learning, Volume 1
Microtome, 2011

** Free PDF ** Amazon ** Barnes and Nobles **

This book harvests three years of effort of hundreds of researchers who have participated to three competitions we organized around five datasets from various application domains. Three aspects were explored:

- Data representation.
- Model selection.
- Performance prediction.

With the proper data representation, learning becomes almost trivial. For the defenders of fully automated data processing, the search for better data representations is just part of learning. At the other end of the spectrum, domain specialists engineer data representations, which are tailored to particular applications. The results of the "Agnostic Learning vs. Prior Knowledge" challenge are discussed in the book and the best papers from the IJCNN 2007 workshop on "Data Representation Discovery" where the best competitors presented their results are included.

Given a family of models with adjustable parameters, Machine Learning provides us with means of "learning from examples" and obtaining a good predictive model. The problem becomes more arduous when the family of models possesses so-called hyper-parameters or when it consists of heterogenous entities (e.g. linear models, neural networks, classification and regression trees, kernel methods, etc.) Both practical and theoretical considerations may yield to split the problem into multiple levels of inference. Typically, at the lower level, the parameters of individual models are optimized and at the second level the best model is selected, e.g. via cross-validation. This problem is often referred to as model selection. The results of the "Model Selection Game" are included in the book as well as the best papers of the NIPS 2006 "Multi-level Inference" workshop.

In most real world situations, it is not sufficient to provide a good predictor, it is important to assess accurately how well this predictor will perform on new unseen data. Before deploying a model in the field, one must know whether it will meet the specifications or whether one should invest more time and resources to collect additional data and/or develop more sophisticated models. The performance prediction challenge asked participants to provide prediction results on new unseen test data AND to predict how good these predictions were going to be on a test set for which they did not know the labels ahead of time. Therefore, you had to design both a good predictive model and a good performance estimator. The results of the "Performance Prediction Challenge" and the best papers of the "WCCI 2006 workshop of model selection" are included in the book.

The best papers of a special topic of JMLR on model selection, including longer contributions of the best challenge participants, are also reprinted in the book.

Audience and distribution

The book is a valuable resource for students, teachers, researchers and engineers in machine learning, data mining and statistics.

We are also making available the datasets of the challenge and sample Matlab code. It is distributed for free in PDF format and is available at printing cost for USD 30 from Amazon and Barnes and Nobles .

Datasets

The datasets of the challenges are freely available to download. We provide below the distribution of the last challenge (Agnostic Learning vs. Prior Knowledge). The dataset of the challenge will remain available for post-challenge submissions, but the test labels will not be revealed. In this way, we keep an on-going benchmark.

Name	Domain	Num. ex. (tr/val/te)	Raw data (for the prior knowledge track)	Preprocessed data (for the agnostic learning track)
ADA	Marketing	4147 / 415 / 41471	14 features, comma separated format, 0.6 MB.	48 features, non-sparse format, 0.6 MB.
GINA	Handwriting recognition	3153 / 315 / 31532	784 features, non-sparse format, 7.7 MB.	970 features, non-sparse format, 19.4 MB.
HIVA	Drug discovery	3845 / 384 / 38449	Chemical structure in MDL-SD format, 30.3 MB.	1617 features, non-sparse format, 7.6 MB.
NOVA	Text classif.	1754 / 175 / 17537	Text. 14 MB.	16969 features, sparse format, 2.3 MB.
SYLVA	Ecology	13086 / 1309 / 130857	108 features, non-sparse format, 6.2 MB.	216 features, non-sparse format, 15.6 MB.

The validation set labels are now available for the agnostic learning track and the prior knowledge track. (they became available to the participants mid-way through the challenge).
All datasets are stored in simple text formats. Sample Matlab code is available to read the data and format the results. The results must be uploaded to the challenge web site for result scoring. See the example of result archive.

Aknowledgements: We are very thankful to the institutions who originally gave the data. A report describing the datasets and giving credit the data donors is available.

Results

The results of the competitions are available on-line:
- Performance Prediction Challenge results.
- Agnostic Learning vs. Prior Knowledge challenge and model selection game:
- December 1st, 2006: Results of the model selection game. [Slides].
- March 1st , 2007: Competition deadline ranking. [Slides]
The March 1st results prompted us to extend the competition deadline because the participants in the "prior knowledge track" are still making progress, as indicated by the learning curves. Presently the prior knowledge track obtains slightly better results than the agnostic learning track, but the differences are not very significant.
- August 1st, 2007: Final ALvsPK challenge results.

Links
Here are the three events, whose outcome are gathered in the book. All of these events concentrated on the study of the same datasets, but emphasized different aspects of the machine learning problem:

Performance prediction:

WCCI 2006 workshop of model selection and performance prediction challenge. We organized a competition on model selection and the prediction of generalization performance. How good are you at predicting how good you are?

Model selection:

NIPS 2006 workshop on multi-level inference and model selection game. We organized a game of model selection using the same datasets as the "Performance prediction challenge" but restricting people to using models from a provided toolbox.

Preprocessing:

IJCNN 2007 Data representation discovery workshop and Agnostic learning vs. Prior knowledge challenge. “When everything fails, ask for additional domain knowledge” is the current motto of machine learning. Therefore, assessing the real added value of prior/domain knowledge is a both deep and practical question.The participants competed in two track: the “prior knowledge track” for which they had access to the raw data and information about the data, and the “agnostic learning track” for which they had access to preprocessed data with no knowledge of the identity of the features.

Other competitions:

ChaLearn keeps organizing new competitions, check them out! .

Contact information

Coordinator:
Isabelle Guyon
ChaLearn
955, Creston Road,
Berkeley, CA 94708, U.S.A.
Tel/Fax: (510) 524 6211

Co-editors:
Amir Reza Saffari Azar (Graz University of Technology), Gideon Dror (Academic College of Tel-Aviv-Yaffo), Gavin Cawley (University of East Anglia, UK).

This material is based upon work supported by the National Science Foundation under Grant N0. ECCS-0424142 and Grant N0. ECCS-0736687. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.