Return to search

A systematic study of offline recognition of Thai printed and handwritten characters

Thai characters pose some unique problems, which differ from English and other oriental scripts. The structure of Thai characters consists of small loops combined with curves and there is an absence of spaces between each word and sentence. In each line, moreover, Thai characters can be composed on four levels, depending on the type of character being written. This research focuses on OCR for the Thai language: printed and offline handwritten character recognition. An attempt to overcome the problems by simple but effective methods is the main consideration. A printed OCR developed by the National Electronics and Computer Technology Center (NECTEC) uses Kohonen self- organising maps (SOMs) for rough classification and back-propagation neural networks for fine classification. An evaluation of the NECTEC OCR is performed on a printed dataset that contains over 0.6 million tokens. Comparisons of the classifier, with and without the aspect ratio, and with and without SOMs, yield small, but statistically significant differences in recognition rate. A very straightforward classifier, the nearest neighbour, was examined to evaluate overall recognition performance and to compare with the classifier. It shows a significant improvement in recognition rate (about 98%) over the NECTEC classifier (about 96%) on both the original and distorted data (rotated and noisy), but at the expense of longer recognition times. For offline handwritten character recognition, three different classifiers are evaluated on three different datasets that contain, on average, approximately 10,000 tokens each. The neural network and HMMs are more effective and give higher recognition rates than the nearest neighbour classifier on three datasets. The best result obtained from the HMMs is 91.1% on ThaiCAM dataset. However, when evaluated on a different dataset, the recognition rates drastically reduce, due to differences in many aspects of online and offline handwritten data. An improvement in classification rates was obtained by adjusting the stroke width of a character in the online handwritten dataset (12 percentage points) and combining the training sets from the three datasets (7.6 percentage points). A boosting algorithm called AdaBoost yields a slight improvement in recognition rate (1.2 percentage points) over the original classifiers (without applying the AdaBoost algorithm).

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:548263
Date January 2011
CreatorsSae-Tang, Sutat
ContributorsCarter, John ; Damper, Robert
PublisherUniversity of Southampton
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttps://eprints.soton.ac.uk/206079/

Page generated in 0.0014 seconds