Using DRCA to see the effects of variable combinations on classifiers

Ofer Melnik
<melnik@dimacs.rutgers.edu>

The original "Curse of Dimensionality" [Bellman] is the observation that the number of data points needed to sample (or specify) a space
is exponential in its dimensionality. Conversely, the number of linear elements needed to separate a group of points decreases in the
dimensionality of the space, from n-1 elements when d=1 to one element when n=d+1. This places the variable selection dilemma on a spectrum,
where low dimensionality offers denser data that is likely to give structurally complex models that may overfit, while high
dimensionality implies sparser data with structurally simpler models that will underfit. Thus the selection process involves weighing these
structural tradeoffs when selecting variables or models.

To confound the process, the structure a model imposes on data is usually hidden within its functional form. Due to interdependence and
shear numbers of parameters, many models are inscrutable to direct inspection, hiding the structure they impose on the data. Thus the
selection task is exacerbated by having only limited tools to gauge the quality and structure of models, beyond raw performance
scores. Decision Region Connectivity Analysis (DRCA) [Melnik] is a new method to help fill that need. In the workshop I intend to present the core DRCA method and demonstrate how it can be used to compare models with different sets
of variables.

Melnik, O., "Decision Region Connectivity Analysis: A method for
analyzing high-dimensional classifiers", Machine Learning Journal, in
press. http://www.wkap.nl/kapis/internet/melnik.pdf