Design of a linguistic postprocessor using variable memory length Markov models.

I. Guyon and F. Pereira.
Unpublished Technical Report, AT&T Bell Laboratories.

We present the design of a linguistic postprocessor for character recognizers. The central module of our system is a trainable variable memory length Markov model (VLMM) which predicts the next character given a variable length window of past characters. The overall system is composed of several finite state automata, including the main VLMM and a proper noun VLMM. The best model reported in the literature (Brown et al 1992) achieves 1.75 bits per character on the Brown corpus. On that same corpus, our model, trained on 10 times less data, reaches 2.19 bits per character and is 200 times smaller (~ 160,000 parameters). The model was designed for handwriting recognition applications but can be used for other OCR problems and speech recognition.

Keywords: Linguistics, finite state automata, probabilities, statistics, statistical languages, statistical grammars, grammar inference, regular languages, handwriting recognition, speech recognition, Markov models, hidden Markov models, n-grams.

[ next paper ]