Various document layout analysis techniques are employed in order to enhance the accuracy of optical character recognition (OCR) in document images. Type-specific document layout analysis involves localizing and segmenting specific zones in an image so that they may be recognized by specialized OCR modules. Zones of interest include titles, headers/footers, paragraphs, images, mathematical expressions, chemical equations, musical notations, tables, circuit diagrams, among others. False positive/negative detections, oversegmentations, and undersegmentations made during the detection and segmentation stage will confuse a specialized OCR system and thus may result in garbled, incoherent output. In this work a mathematical expression detection and segmentation (MEDS) module is implemented and then thoroughly evaluated. The module is fully integrated with the open source OCR software, Tesseract, and is designed to function as a component of it. Evaluation is carried out on freely available public domain images so that future and existing techniques may be objectively compared. / Master of Science
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/46724 |
Date | 19 March 2014 |
Creators | Bruce, Jacob Robert |
Contributors | Electrical and Computer Engineering, Abbott, A. Lynn, Hsiao, Michael S., Xuan, Jianhua |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0016 seconds