Radford M. Neal and Jianguo Zhang
radford@cs.utoronto.ca
jianguo@utstat.utoronto.ca
University of Toronto
We describe the methods we used for the high-dimensional classification
problems in the NIPS feature selection contest. As a preliminary
step, we either looked at a moderate number of principle components from
the data, or we looked at a moderate number of features selected using
simple significance tests. This reduces the computational requirements
to a manageable level. We then applied two very different appoaches
to classification. Neural networks (multi-layer perceptrons) are
supervised methods, which do not model the distribution of training cases.
With Bayesian methods, these networks can be made quite complex without
fear of overfitting the training data. An Automatic Relevance Determination
(ARD) prior can also be used to allow the model to discover that some inputs
(features or principle components) are much more relevant than others.
The results of runs using ARD were sometimes used to select a further reduced
set of features. Dirichlet diffusion trees are a Bayesian method
for modeling a distribution, based on hierarchical clustering. We used
Dirichlet diffusion trees to model the input distribution for all cases
(training, validation, and test), ignoring class labels. Hyperparameters
allowed the model to learn which inputs relate to others, and which are
(possibly) irrelevant "noise". The result is a sample of trees that
cluster the cases. We then used distances in these trees to classify
cases by simple neighbor-based methods. Based on results on the validation
set, we chose the Dirichlet diffusion tree method for the Arcene data and
the Bayesian neural network method for Gisette, Dexter, and Dorothea.
For Madelon, where the two methods did about equally well, we averaged
the probabilities produced by the two methods. The results were very
good, but they do look at most or all of the features for most of the data
sets, due to the use of principle components. Results were not as
good using Bayesian neural networks with a much smaller fraction of the
features (4.74% on average), but this approach might be useful if reducing
the number of features used is important.