Clopinet
KDD 2009 1/2 day tutorial:

KDD09

Practical Feature Selection
from Correlation to Causality

Isabelle Guyon

July 28, 2009
Paris, Fance


Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. This tutorial will cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

Most feature selection methods do not attempt to uncover causal relationships between feature and target and focus instead on making best predictions. We will examine situations in which the knowledge of causal relationships benefits feature selection. Such benefits may include: explaining relevance in terms of causal mechanisms, distinguishing between actual features and experimental artifacts, predicting the consequences of actions performed by external agents, and making predictions in non-stationary environments.

The objective of this tutorial is to provide a theoretical framework and practical tools to address the problem of feature selection in a variety of situations:
-    when the number of samples and/or features vary across several orders of magnitude,
-    when the feature or the target are binary, categorical, or continuous,
-    when the features are sparse or dense
-    when the classes are balanced or imbalanced
-    when the features are independent or correlated
-    when the data are clean or plagued with noise or experimental artifacts
-    when the data are i.i.d. or when there are changes in distributions, eventually resulting from interventions on the system by external agents.

A variety of algorithms will be reviewed and contrasted. Particular attention will be given to feature validity assessment methods, via statistical testing and/or proper use of cross-validation.

Audience

This tutorial will be accessible to a broad audience, including practitioners, researchers, and students, who want to catch up with the latest developments in feature selection. Some previous exposure to machine learning/data mining problems is preferable, but not necessary. 

Outline and slides

Lecture 1 (45 min.)Introduction to feature selection: filters and wrappers

Feature selection is an essential step in machine learning, performed either separately or jointly with the learning process. We will present some applications that pose challenges to feature selection because of the high dimensionality of input space (text categorization, drug screening, gene selection.) A quick review of the machine learning vocabulary (supervised/unsupervised learning, risk minimization, overfitting, maximum likelihood, etc.) will also be done. Correlation methods and other methods for ranking feature in order of relevance will be presented. Such methods will be contracted with wrappers methods, which use a learning machine to assess the usefulness of subsets of features. A search method is used to explore feature space since for a large number of features it is infeasible to try all possible feature subsets. We will review search strategies, including forward, and backward selection, floating search, simulated annealing and genetic algorithms. One difficult problem with wrappers is to avoid overfitting: by searching heavily feature space it is easy to find one subset of features that will perform well on a limited number of examples. We will compare the search methods for their relative computational and statistical complexity to determine which ones are the most promising. The class will be complemented by a review of cross-validation techniques, which are necessary to master when using wrapper methods.
The goal of this lecture is to provide the students with easy to implement  methods, which are powerful enough to match or exceed the performances of the best ranking methods in the feature selection challenge we organized.

Lecture 2 (45 min.)Embedded methods of feature selection: Learning theory put to work to build feature selection algorithms

Ockham's Razor is the principle proposed by William of Ockham in the fourteenth century: "Pluralitas non est ponenda sine neccesitate'', which translates as ``entities should not be multiplied unnecessarily''. This principle provides guidance in modeling: of two theories providing similarly good predictions, prefer the simplest one. Hence, according to Ockham, we should shave off unnecessary parameters of our models. In the brain, information is stored in billions of synapses connecting the brain cells or neurons. The young children grow synapses in excess. Exposure to enriched environments with extra sensory and social stimulation enhances the connectivity of the synapses, but children and adolescents also lose them up to 20 million synapse per day. This synaptic pruning is believed to play an important role in learning. In this lecture, we will justify theoretically the role of feature selection in learning. We will connect it to rhe problem of overfitting, regularization and model prior. In several models, the number of parameters of the model is directly related to the number of inputs or features, hence, there is an obvious link between overfitting and feature selection. We will make this link explicit. This lecture will introduce algorithms that select features in the process of learning, including gradient based feature scaling; nested subset methods; use of performance bounds; direct objective optimization and decision trees. It will provide the necessary basis to develop new custom-made algorithms for domain-specific applications.

Lecture 3 (45 min.): Causality and feature selection: Limitations of methods of feature selection ignoring the data selection process

Determining and exploiting causal relationships is central in human reasoning and decision-making. The goal of determining causal relationships is to predict the consequences of given actions. This is fundamentally different from the classical machine learning setting, which focuses on making predictions from observations. Observations imply no manipulation on the system under study whereas actions introduce a disruption in the natural functioning of the system. Most of machine learning does not attempt to uncover causal relationships in data, as it is unnecessary to make good predictions in stationary environments. For instance, in medical diagnosis, the abundance of a protein in serum may be used as a predictor of disease no matter whether the protein is a cause of the disease (resulting from a gene mutation), or a consequence (an antibody responding to inflammation) or neither. If one is solely interested in a diagnosis, a correlation between protein abundance and symptom is enough. However, the knowledge of causal relationships becomes very useful to making predictions in non-stationary environments and, in particular, under the action of external factors or manipulation of external agents. Policy making in health care, economics, or ecology are examples of manipulations, of which it is desirable to know the consequences ahead of time. For instance, smoking and coughing are both predictive of respiratory disease. One is a cause and the other a symptom. Acting on the cause can change your health status, but not acting on the symptom. Thus it is extremely important to distinguish between causes and symptoms to predict the consequences of actions like predicting the effect of forbidding smoking in public place. In this lecture,  we will examine situations in which the knowledge of causal relationships benefits feature selection. Such benefits may include: explaining relevance in terms of causal mechanisms, distinguishing between actual features and experimental artifacts, predicting the consequences of actions performed by external agents, and making predictions in non-stationary environments.

Lecture 4 (45 min.): Causation and prediction: Basic causal discovery algorithms. Prediction under manipulation.
Research in causal modeling is driven by the growing needs of applications for making predictions of actions and assisting policy making in many domains, including: biology, medicine and pharmacology, epidemiology, climatology, social and economic sciences, marketing, neuroscience, psychology, law enforcement and crime prevention, manufacturing,quality control, and fault or security diagnosis. Recording observational data is becoming less and less expensive, so applications in which thousands to millions of variables are being recorded are not uncommon. But intervening on a system and proactively changing variable values, which remains the gold standard to establish causal relationships, or in other words"experimenting", is usually several orders of magnitude more expensive. In two challenges we recently organized we evaluated the capability of models to learn causal relationships from purely observational data and then make predictions of the consequences (on a target variable) of  “would be” actions performed on a subset of variables. These challenge demonstrated that (i) the causal structure can be identified to a certain extent from observational data, and (ii) there is a significant correlation between correct causal structure identification and target value predictions for data drawn from a “post-manipulation distribution”. This lecture will review success stories of algorithms and applications of causal discovery in these recent challenges [Note: the slides of this lecture will be updated with the analysis of the latest challenge].

Links and resources

Contact information

Lecturer:
Isabelle Guyon
Clopinet Enterprises
955, Creston Road,
Berkeley, CA 94708, U.S.A.
Tel/Fax: (510) 524 6211