On Model Selection in Clustering

Volker Roth and Tilman Lange

Institute for Computational Science 
ETH Zentrum
Zürich, Switzerland

One of the fundamental problems of learning theory is model assessment: given a specific data set, how can one practically measure the generalization performance of a model trained to the data. In supervised learning, the standard technique is cross-validation. It consists in using only a subset of the data for training, and then testing on the remaining data in order to estimate the expected risk of the predictor. For unsupervised learning, there does not exist a standard technique for estimating the generalization of an algorithm. Furthermore, the problem of model order selection arises, i.e. estimating the "correct" number of clusters. This number is part of the input data for supervised problems, but it is not available for unsupervised problems.

We present a common point of view, which provides a unified framework for model assessment in these different areas of machine learning. The main idea is that an algorithm generalizes well, if the solution on one data set has small disagreement with the solution on another data set. This idea is independent of the amount of label information which is supplied to the problem, and the challenge is to define disagreement in a meaningful way. The main emphasis lies on developing model assessment procedures for unsupervised clustering.

Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract "natural" group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question.

We demonstrate the model selection capabilities of the stability method both for toy examples and real world clustering applications, where we have put particular emphasis on the problem of finding segments in natural images. While stable clusterings do not necessarily correspond to the "desired" segmentation solutions (which probably cannot be expected from any internal model assessment strategy), we demonstrate that the absence of stable solutions is a good indicator for model mismatches.