Learning discriminative feature transforms may be an easier problem than feature selection

Kari Torkkola
Motorola Labs

Optimal feature selection coupled with a classifier leads to a combinatorial problem since it would be desirable to evaluate all combinations of available features. Greedy algorithms based on sequential feature selection using any filter criterion are suboptimal, as they fail to find a feature set that would jointly maximize the criterion. For this very reason, finding a transformation to lower dimensions might be easier than selecting features, given an appropriate differentiable objective function. We discuss mutual information between class labels and transformed features as such a criterion. Instead of Shannon's definition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of "information potentials" and "information forces" induced by samples of data.