Suggested open problems for the NIPS 2002 workshop ------------------------------------------------------------------------------- Compiled by isabelle@clopinet.com Vladimir Vapnik Build theory and algorithms for the transduction principle Transduction has been very successful in applications, see e.g. J. Weston, F. Perez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff and B. Schoelkopf. "Feature Selection and Transduction for Prediction of Molecular Bioactivity for Drug Design". Bioinformatics. Yet the development of methods and theory for transduction is in its infancy. Transduction does not aim at building a predictor. Rather given a set of example, a subset of which is provided with target values, how to predict the missing target values? Stuart Russell 1) Unifying statistical approaches to learning (incl kernel methods) with expressive relational, symbolic representations, particularly so that new representational elements (predicates, functions, constants) can be constructed. 2) Related to 1: enable the use of prior knowledge in learning in such a way that the outputs of learning are used cumulatively to facilitate new learning in a closed cycle. This requires forms of knowledge to be used and constructed that differ from traditional deterministic function representations, conditional distributions, generative and discriminative models, etc. 3) Find representations (and methods of using those representations) for weak/partial knowledge of how to behave so as to achieve goals and/or maximize expected utility in complex environments, such that these can be used to constrain the learning of new tasks. Chris Watkins i) why don't genetic algorithms work? (they should but they don't) (actually this is no longer so open as I have a sufficient answer to why current ones don't, but i'm still not sure of how to make them work...) ii) how can a neurally plausible system learn grammar/parse/attach meaning from in some sense analysing grammar Alex Smola a) project onto random directions in reproducing kernel hilbert spaces without the possibly infinite complexity. or show that some other samples from a manifold can be sufficient for random embeddings ... b) how to define a 'reasonable' norm on the equivalent of bilinear forms on semirings c) can you perform gappy matches in linear time for all subsets. or to put it differently, what are lower bounds on dynamic programming. Terry Sejnowski How do brains predict the future? What is the brain's objective function? "Thomas G. Dietterich" 1. How can we evaluate anomaly detection systems? What is the loss function? What is the methodology? 2. The primary way that we communicate our prior knowledge to learning algorithms is through the features that we provide. These are typically designed by hand based on prior knowledge of the engineer. How can we automate the design of features? 3. What is a principled way of solving machine learning problems in which each instance X can have several labels simultaneously? This is the multiple-label problem. For example, X might be a document, and the labels are the categories under which it should be indexed. We would want to assume that each X takes on a relatively small number of possible labels. Assume that the training data consists only of (X,y) pairs, where y is only a SINGLE label. The loss function should probably give separate penalties for each missing or extra label. Olvi Mangasarian Given prior knowledge in the form of polyhedral or nonpolyhedral convex sets, how do you incorporate such knowledge into a nonlinear kernel SVM classifier? The case of polyhedral prior knowledge sets has been solved for linear kernels and is being presented at NIPS2002 as paper AA05: "Knowledge-Based Support Vector Machine Classifiers". Philip Dawid 1) "why should we believe that cross-validation (in any of its variants) gives a reliable assessment of out of sample performance? develop a theoretical foundation for this, and compare with alternatives such as prequential validation." for the latter, see e.g. my encyclopedia article at http://www.ucl.ac.uk/Stats/research/psfiles/141.zip. my own belief is that cross-validation is nothing like the "gold standard" it is usually assumed, without evidence, to be, and that it compares badly with prequential validation --- but part of the problem is to develop ways of performing the comparison that are not themselves question-begging. 2) "can we develop a general theory (of e.g. tranduction) that is not tied to the explicit or implicit assumption that cases are drawn at random from a fixed distribution?" again, i think that the prequential approach, which makes no such assumptions, actually makes much of the theory easier, by stripping away irrelevant dead wood. Tommy Poggio 1) is square loss regularization as good as SVM for binary classification? 2) is "one vs all" the best multiclass technique? Grace Wahba One problem is choice of kernel for various unusual data structures.. i. e. on dna sequences.. Alternatively, good/best preprocessing via nonlinear transformations on the observations in some systematic way ... (simple trial and error can be good if you have a big enough test set.. ) David Lewis 1) What are principled, user-friendly ways of incorporating prior knowledge in supervised learning that do not (in contrast to the Bayesian paradigm) assume that "truth" lies in your hypothesis space? 2) Develop a theoretical framework, preferably one that provides useful empirical guidance, for active learning from finite samples of unlabeled data. For extra credit, develop label-efficient techniques for producing unbiased estimates of the test set effectiveness of classifiers trained by active learning. 3) Develop a theoretical framework for online learning in finite space with drifting concepts defined over an infinite number of predictor featu Isabelle Guyon 1) What number of K in K-fold cross-validation gives the best bias/variance tradeoff? This problem is not simplified by the recent result of Yoshua Bengio that there is no unbiased estimator of the variance of K-fold. It is obviously data dependent. Are there statistics that can be computed from data to give a PAC answer to that question? 2) Greedy method have been found to overfit less than fancier search techniques. What is the underlying regularization mechanism? 3) Quantum learning. Humans seem to be making quantum leaps in learning certain tasks. Is there a simple model that could account for such "quantum learning"? These jumps seem to coincide with the acquisition of higher levels of abstraction and a language to describe these abstractions. The model may need to merge inductive and deductive learning (see related problem by Stuart Russel).