JISE


  [1] [2] [3] [4] [5] [6] [7] [8]


Journal of Information Science and Engineering, Vol. 16 No. 6, pp. 903-922


200011_07.pdf


U. Pal, P. K. Kundu and B. B. Chaudhuri
Computer Vision and Pattern Recognition Unit 
Indian Statistical Institute 
203, B. T. Rd., Calcutta, 
700035 India 
E-mail: {umapada, bbc}@isical.ac.in


    This paper deals with an OCR (Optical Character Recognition) error detection and correction technique for a highly inflectional Indian language, Bangla, the second-most popular language in India and fifth-most popular language in the world. The technique is based on morphological parsing where using two separate lexicons of root words and suffixes, the candidate root-suffix pairs of each input string, are detected, their grammatical agreement is tested and the root/suffix part in which the error has occurred is noted. The correction is made to the corresponding error part of the input string by means of a fast dictionary access technique. To do so, the information about the error patterns generated by the OCR system are examined, and some alternative strings are generated for an erroneous word. Among the alternative strings, those satisfying grammatical agreement in root and suffix are finally chosen as suggested words. In the list of suggested words generated by the system, the desired word is available in 84.22% cases.


Keywords: OCR (Optical Character Recognition), error detection, error correction, Indian language, morphological parsing, suffix, inflectional language

  Retrieve PDF document (JISE_200006_08.pdf)