The original "Curse of Dimensionality"
[Bellman] is the observation that the number of data points needed to sample
(or specify) a space
is exponential in its dimensionality.
Conversely, the number of linear elements needed to separate a group of
points decreases in the
dimensionality of the space, from n-1
elements when d=1 to one element when n=d+1. This places the variable selection
dilemma on a spectrum,
where low dimensionality offers denser
data that is likely to give structurally complex models that may overfit,
while high
dimensionality implies sparser data with
structurally simpler models that will underfit. Thus the selection process
involves weighing these
structural tradeoffs when selecting variables
or models.
To confound the process, the structure
a model imposes on data is usually hidden within its functional form. Due
to interdependence and
shear numbers of parameters, many models
are inscrutable to direct inspection, hiding the structure they impose
on the data. Thus the
selection task is exacerbated by having
only limited tools to gauge the quality and structure of models, beyond
raw performance
scores. Decision Region Connectivity Analysis
(DRCA) [Melnik] is a new method to help fill that need. In the workshop
I intend to present the core DRCA method and demonstrate how it can be
used to compare models with different sets
of variables.
Melnik, O., "Decision Region Connectivity
Analysis: A method for
analyzing high-dimensional classifiers",
Machine Learning Journal, in
press. http://www.wkap.nl/kapis/internet/melnik.pdf