Super-supervised learning for large database cleaning.

N. Matic, I. Guyon, L. Bottou, J. Denker, and V. Vapnik.
Technical Report 11359-920723-17TM, AT&T Bell Laboratories.

A method of computer-aided cleaning of undesirable patterns in large databases has been developed. The method uses the trainable classifier itself to point out the patterns that should be checked by the human supervisor. While suspicious patterns that are meaningless or mislabeled are considered garbage and removed, the remaining patterns, like ambiguous or atypical, should be kept in the database. By using our method of pattern cleaning, combined with an emphasizing training scheme, the error rate on the test set is reduced by half, in the case of the database of handwritten lowercase characters entered on a touch terminal.

Keywords: database cleaning, outliers, outlyers, on-line character recognition, neural networks, handwritten characters, touch terminal, robust statistics.

[ next paper ]