Computer aided cleaning of large databases for character recognition.

N. Matic, I. Guyon, L. Bottou, J. Denker, and V. Vapnik.
In 11th International Conference on Pattern Recognition, volume II, pages 330--333.

A method for computer-aided cleaning of undesirable patterns in large training databases has been developed. The method uses the trainable classifier itself, to point out patterns that are suspicious, and should be checked by the human supervisor. While suspicious patterns that are meaningless or mislabeled are considered garbage, and removed from the database, the remaining patterns, like ambiguous or atypical, represent valid patterns that are hard to learn and should be kept in the database. By using our method of pattern cleaning, combined with an emphasizing scheme applied on the patterns that are hard to learn (increasing the frequency of presentation), the error rate on the test set has been reduced by half, in the case of the database of handwritten lowercase characters entered on a touch terminal. Our classifier is based on a Time Delay Neural Network (TDNN) and was trained on a large database of 11,500 characters from approximately 250 writers.

[ next paper ]