We consider feature selection in the setting
of many irrelevant features. We show that, using a certain prior, the sample
complexity of Bayesian feature selection---that
is, the number of training examples needed to learn ``well''---grows only
\emph{logarithmically} in the number of
irrelevant features. This means that, from an information theoretical point
of
view, the algorithm can learn even in
the presence of \emph{exponentially} many irrelevant features as training
examples. This result recovers the best
know such rates proved in the frequentist setting (Littlestone, 1988; Kivinen
and Warmuth, 1994; Ng, 1998), and beats
the standard wrapper approach (John et al., 1994) to feature selection.
Experimental
results also show the method exhibiting
very high tolerance to the presence of many irrelevant features.