Data sets for OCR and document image understanding.

I. Guyon, R.M. Haralick, J. Hull, and I. Phillips.
In H. Bunke and Wang P.S.P., editors, Handbook on Optical Character Recognition and Document Image Analysis. World Scientific Publishing Company, in press.

Several significant sets of labeled samples of image data are surveyed that can be used in the development of algorithms for offline and online handwriting recognition as well as for machine printed text recognition. The method used to gather each data set, the numbers of samples they contain, and the associated truth data are discussed. In the domain of offline handwriting, the CEDAR, NIST, and CENPARMI data sets are presented. These contain primarily isolated digits and alphabetic characters. The UNIPEN data set of online handwriting was collected from a number of independent sources and it contains individual characters as well as handwritten phrases. The University of Washington document image databases are also discussed. They contain a large number of English and Japanese document images that were selected from a range of publications.

Keywords: data sets, databases, text images, online handwriting, offline handwriting, machine printed text, CEDAR, NIST, CENPARMI, UNIPEN, University of Washington.

[ next paper ]