Isabelle Guyon
Audience This tutorial
will be accessible to a broad audience, including practitioners, researchers,
and students, who want to catch up with the latest developments in feature
selection. Some previous exposure to machine learning/data mining problems
is preferable, but not necessary. Outline and slides Lecture 1
(45 min.): Introduction to feature selection:
filters and wrappers Feature selection
is an essential step in machine learning, performed either
separately or jointly with the learning process. We will present
some applications that pose challenges to feature selection
because of the high dimensionality of input space (text categorization,
drug screening, gene selection.) A quick review of the machine
learning vocabulary (supervised/unsupervised learning, risk minimization,
overfitting, maximum likelihood, etc.) will also be done. Correlation methods and
other methods for ranking feature in order of relevance will be
presented. Such methods
will be contracted with wrappers methods, which use a learning machine
to assess the usefulness of subsets of features. A search method is used
to explore feature space since for a large number of features it is infeasible
to try all possible feature subsets. We will review search strategies,
including forward, and backward selection, floating search, simulated
annealing and genetic algorithms. One difficult problem with wrappers is
to avoid overfitting: by searching heavily feature space it is easy to
find one subset of features that will perform well on a limited number
of examples. We will compare the search methods for their relative computational
and statistical complexity to determine which ones are the most promising.
The class will be complemented by a review of cross-validation techniques,
which are necessary to master when using wrapper methods. Lecture 2 (45 min.): Embedded methods of feature selection: Learning theory put to work to build feature selection algorithms Ockham's
Razor is the principle proposed by William of Ockham in the fourteenth
century: "Pluralitas non est ponenda sine neccesitate'', which
translates as ``entities should not be multiplied unnecessarily''.
This principle provides guidance in modeling: of two theories providing
similarly good predictions, prefer the simplest one. Hence, according
to Ockham, we should shave off unnecessary parameters of our models.
In the brain, information is stored in billions of synapses connecting
the brain cells or neurons. The young children grow synapses
in excess. Exposure to enriched environments with extra sensory and
social stimulation enhances the connectivity of the synapses, but
children and adolescents also lose them up to 20 million synapse per
day. This synaptic pruning is believed to play an important role in learning.
In this lecture, we will justify theoretically the role of feature selection
in learning. We will connect it to rhe problem of overfitting, regularization
and model prior. In several models, the number of parameters of
the model is directly related to the number of inputs or features,
hence, there is an obvious link between overfitting and feature selection.
We will make this link explicit.
This lecture will
introduce algorithms
that select features in the process of learning, including
gradient based feature scaling; nested subset methods; use
of performance bounds; direct objective optimization and decision
trees. It will provide the necessary basis to develop new custom-made
algorithms for domain-specific applications. Lecture 3
(45 min.): Causality and feature selection:
Limitations of methods of feature
selection ignoring the data selection process Determining and exploiting causal relationships is central in human reasoning and decision-making. The goal of determining causal relationships is to predict the consequences of given actions. This is fundamentally different from the classical machine learning setting, which focuses on making predictions from observations. Observations imply no manipulation on the system under study whereas actions introduce a disruption in the natural functioning of the system. Most of machine learning does not attempt to uncover causal relationships in data, as it is unnecessary to make good predictions in stationary environments. For instance, in medical diagnosis, the abundance of a protein in serum may be used as a predictor of disease no matter whether the protein is a cause of the disease (resulting from a gene mutation), or a consequence (an antibody responding to inflammation) or neither. If one is solely interested in a diagnosis, a correlation between protein abundance and symptom is enough. However, the knowledge of causal relationships becomes very useful to making predictions in non-stationary environments and, in particular, under the action of external factors or manipulation of external agents. Policy making in health care, economics, or ecology are examples of manipulations, of which it is desirable to know the consequences ahead of time. For instance, smoking and coughing are both predictive of respiratory disease. One is a cause and the other a symptom. Acting on the cause can change your health status, but not acting on the symptom. Thus it is extremely important to distinguish between causes and symptoms to predict the consequences of actions like predicting the effect of forbidding smoking in public place. In this lecture, we will examine situations in which the knowledge of causal relationships benefits feature selection. Such benefits may include: explaining relevance in terms of causal mechanisms, distinguishing between actual features and experimental artifacts, predicting the consequences of actions performed by external agents, and making predictions in non-stationary environments. Lecture 4
(45 min.): Causation and prediction: Basic
causal discovery algorithms. Prediction under manipulation. Links and resources
Contact information Lecturer:
|