##
Using DRCA to see the effects of variable
combinations on classifiers

Ofer Melnik

<melnik@dimacs.rutgers.edu>
The original "Curse of Dimensionality"
[Bellman] is the observation that the number of data points needed to sample
(or specify) a space

is exponential in its dimensionality.
Conversely, the number of linear elements needed to separate a group of
points decreases in the

dimensionality of the space, from n-1
elements when d=1 to one element when n=d+1. This places the variable selection
dilemma on a spectrum,

where low dimensionality offers denser
data that is likely to give structurally complex models that may overfit,
while high

dimensionality implies sparser data with
structurally simpler models that will underfit. Thus the selection process
involves weighing these

structural tradeoffs when selecting variables
or models.

To confound the process, the structure
a model imposes on data is usually hidden within its functional form. Due
to interdependence and

shear numbers of parameters, many models
are inscrutable to direct inspection, hiding the structure they impose
on the data. Thus the

selection task is exacerbated by having
only limited tools to gauge the quality and structure of models, beyond
raw performance

scores. Decision Region Connectivity Analysis
(DRCA) [Melnik] is a new method to help fill that need. In the workshop
I intend to present the core DRCA method and demonstrate how it can be
used to compare models with different sets

of variables.

Melnik, O., "Decision Region Connectivity
Analysis: A method for

analyzing high-dimensional classifiers",
Machine Learning Journal, in

press. http://www.wkap.nl/kapis/internet/melnik.pdf