Book in preparation:

Hands-on pattern recognition
challenges in data representation, model selection, and performance prediction    
Isabelle Guyon, Gavin Cawley, Gideon
Dror, and Amir Saffari Editors

** Book draft **

Background

Recently organized competitions have been instrumental in pushing the state-of-the-art in machine learning, establishing benchmarks to fairly evaluate methods, and identifying techniques, which really work.
This book under preparation will havest three years of effort of hundreds of researchers who have participated to three competitions we organized around five datasets from various application domains. Three aspects were explored:

- Data representation.
- Model selection.

-
Performance prediction.

With the proper data representation, learning becomes almost trivial. For the defenders of fully automated data processing, the search for better data representations is just part of learning. At the other end of the spectrum, domain specialists engineer data representations, which are tailored to particular applications. The results of the "Agnostic Learning vs. Prior Knowledge" challenge will be discussed in the book and the best papers from the IJCNN 2007 workshop on "Data Representation Discovery" where the best competitors presented their results will be included.

Given a family of models with adjustable parameters, Machine Learning provides us with means of "learning from examples" and obtaining a good predictive model. The problem becomes more arduous when the family of models possesses so-called hyper-parameters or when it consists of heterogenous entities (e.g. linear models, neural networks, classification and regression trees, kernel methods, etc.) Both practical and theoretical considerations may yield to split the problem into multiple levels of inference. Typically, at the lower level, the parameters of individual models are optimized and at the second level the best model is selected, e.g. via cross-validation. This problem is often referred to as model selection. The results of the "Model Selection Game" will be included in the book as well as the best papers of the NIPS 2006 "Multi-level Inference" workshop.

In most real world situations, it is not sufficient to provide a good predictor, it is important to assess accurately how well this predictor will perform on new unseen data. Before deploying a model in the field, one must know whether it will meet the specifications or whether one should invest more time and resources to collect additional data and/or develop more sophisticated models. The performance prediction challenge asked participants to provide prediction results on new unseen test data AND to predict how good these predictions were going to be on a test set for which they did not know the labels ahead of time. Therefore, you had to design both a good predictive model and a good performance estimator. The results of the "Performance Prediction Challenge" and the best papers of the "WCCI 2006 workshop of model selection" will be included in the book.

The best papers of a special topic of JMLR on model selection, including longer contributions of the best challenge participants, will also be reprinted in the book.

Audience and distribution

The book is going to be a valuable resource for students, teachers, researchers and engineers in machine learning, data mining and statistics.

The book will include a CD with the datasets of the challenge and sample Matlab code. It will be distributed for free in PDF format and printed on demand by Lulu for an estimated price of $30.

Contents

see book draft

Datasets

The datasets of the challenges are freely available to download. We provide below the distribution of the last challenge (Agnostic Learning vs. Prior Knowledge). The dataset of the challenge will remain available for post-challenge submissions, but the test labels will not be revealed. In this way, we keep an on-going benchmark.

Name
Domain
Num. ex. (tr/val/te)
Raw data (for the prior knowledge track)
Preprocessed data (for the agnostic learning track)
ADA
Marketing
4147 / 415 / 41471
14  features, comma separated format, 0.6 MB.
48 features, non-sparse format, 0.6 MB.
GINA
Handwriting
recognition
3153 / 315 / 31532
784 features, non-sparse format, 7.7 MB.
970 features, non-sparse format, 19.4 MB.
HIVA
Drug discovery
3845 / 384 / 38449
Chemical structure in MDL-SD format, 30.3 MB.
1617 features, non-sparse format, 7.6 MB.
NOVA
Text classif.
1754 / 175 / 17537
Text. 14 MB.
16969 features, sparse format, 2.3 MB.
SYLVA
Ecology
13086 / 1309 / 130857
108 features, non-sparse format, 6.2 MB.
216 features, non-sparse format, 15.6 MB.

The validation set labels are now available for the agnostic learning track and the prior knowledge track. (they became available to the participants mid-way through the challenge).
All datasets are stored in simple text formats. Sample Matlab code is available to read the data and format the results. The results must be uploaded to the challenge web site for result scoring. See the example of result archive.

Aknowledgements:
We are very thankful to the institutions who originally gave the data. A report describing the datasets and giving credit the data donors is available.

Results

The results of the competitions are available on-line:
- Performance Prediction Challenge results.
- Agnostic Learning vs. Prior Knowledge challenge and model selection game:
- December 1st, 2006: Results of the model selection game. [Slides].
- March 1st , 2007: Competition deadline ranking. [Slides]
The March 1st results prompted us to extend the competition deadline because the participants in the "prior knowledge track" are still making progress, as indicated by the learning curves. Presently the prior knowledge track obtains slightly better results than the agnostic learning track, but the differences are not very significant.

- August 1st, 2007: Final ALvsPK challenge results.

Guidelines
- For presentation unity, the JMLR paper formating should be used, even for papers re-using in part IJCNN proceedings, see jmlr.org.
- The chapter length should be between 8 and 20 pages long (except special cases for already accepted JMLR papers).
- As far as possible, standard notations should be used:


X a sample of input patterns (use capital italic style to designate sets)
F feature space
Y a sample of output labels
ln logarithm to base e
log2 logarithm to base 2
x.x'  inner product between vectors x and x'
||.|| Euclidean norm
n number of input variables
N number of features (if different from number of input variables)
m number of training examples
xk input vector, k=1...m
fk feature vector, k=1...m
xk,i  input vector elements, i=1...n
fk,i feature vector elements, i=1...n
yi target values, or (in pattern recognition) classes
w input weight vector or feature weight vector
wi weight vector elements, i=1...n or i=1...N
b constant offset (or threshold)
h VC dimension
F a concept space
f(.) a concept or target function
G a predictor space
g(.) a predictor function function (real valued or with values in {-1,1}for classification)
s(.) a non linear squashing function (e.g. sigmoid)
rf(x, y) margin function equal to y f(x)
l(x; y; f(x)) loss function
R(g) risk of g, i.e. expected fraction of errors
Remp(g) empirical risk of g, i.e. fraction of training errors
R(f) risk of f
Remp(f) empirical risk of f
k(x, x') Mercer kernel function (real valued)
A a matrix (use capital letters for matrices)
K matrix of kernel function values
ak Lagrange multiplier or pattern weights, k=1...m
a vector of all Lagrange multipliers
xi slack variables
x vector of all slack variables
C regularization constant for SV Machines

Schedule

August 7, 2007: Deadline for the receipt of the ALvsPK challenge fact sheets.
August 31, 2007:
Deadline for submission of revised version of JMLR papers on model selection.
September 30, 2007: Deadline for all book chapters, including:
- New contributed chapters
, including the results of the last ranking (Aug 1st, 2007).
- Revised versions of the IJCNN 2007 papers, including the results of the last ranking (Aug 1st, 2007). Less than half of the IEEE copyrighted material of each IJCNN proceedings paper should be re-used and the title should be changed to reflect the novel material. This is a requirement of the IEEE to be allowed to distribute the new version for free.
- Selected contributions of the WCCI 2006 workshop proceedings papers, using the same guidelines (re-used less than half of the original IEEE copyrighted material + new title).
December, 2007: Scheduled publication.

Links
Here are the three events, whose outcome will be gathered in the book. All of these events concentrated on the study of the same datasets, but emphasized different aspects of the machine learning problem:

Performance prediction:

WCCI 2006 workshop of model selection and performance prediction challenge. We organized a competition on model selection and the prediction of generalization performance. How good are you at predicting how good you are?

Model selection:

NIPS 2006 workshop on multi-level inference and model selection game. We organized a game of model selection using the same datasets as the "Performance prediction challenge" but restricting people to using models from a provided toolbox.

Preprocessing:

IJCNN 2007 Data representation discovery workshop and Agnostic learning vs. Prior knowledge challenge. “When everything fails, ask for additional domain knowledge” is the current motto of machine learning. Therefore, assessing the real added value of prior/domain knowledge is a both deep and practical question.The participants competed in two track: the “prior knowledge track” for which they had access to the raw data and information about the data, and the “agnostic learning track” for which they had access to preprocessed data with no knowledge of the identity of the features.

Contact information

Coordinator:
Isabelle Guyon
Clopinet Enterprises
955, Creston Road,
Berkeley, CA 94708, U.S.A.
Tel/Fax: (510) 524 6211

Co-editors:
Amir Reza Saffari Azar
(Graz University of Technology) Gideon Dror (Academic College of Tel-Aviv-Yaffo), Gavin Cawley (University of East Anglia, UK).

NSF logo This material is based upon work supported by the National Science Foundation under Grant N0. ECCS-0424142 and Grant N0. ECCS-0736687. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.