Optimal feature selection coupled with
a classifier leads to a combinatorial problem since it would be desirable
to evaluate all combinations of available features. Greedy algorithms based
on sequential feature selection using any filter criterion are suboptimal,
as they fail to find a feature set that would jointly maximize the criterion.
For this very reason, finding a transformation to lower dimensions might
be easier than selecting features, given an appropriate differentiable
objective function. We discuss mutual information between class labels
and transformed features as such a criterion. Instead of Shannon's definition
we use measures based on Renyi entropy, which lends itself into an efficient
implementation and an interpretation of "information potentials" and "information
forces" induced by samples of data.