Global ETD Search

Return to search

Improvement of Optical Character Recognition on Scanned Historical Documents Using Image Processing

As an effort to improve accessibility to historical documents, digitization of historical archives has been an ongoing process at many institutions since the origination of Optical Character Recognition. The old, scanned documents can contain deteriorations acquired over time or caused by old printing methods. Common visual attributes seen on the documents are variations in style and font, broken characters, ink intensity, noise levels and damage caused by folding or ripping and more. Many of these attributes are disfavoring for modern Optical Character Recognition tools and can lead to failed character recognition. This study approaches stated problem by using image processing methods to improve the result of character recognition. Furthermore, common image quality characteristics of scanned historical documents with unidentifiable text are analyzed. The Optical Character Recognition tool used to conduct this research was the open-source Tesseract software. Image processing methods like Gaussian lowpass filtering, Otsu’s optimum thresholding method and morphological operations were used to prepare the historical documents for Tesseract. Using the Precision and Recall classification method, the OCR output was evaluated, and it was seen that the recall improved by 63 percentage points and the precision by 18 percentage points. This shows that using image pre-processing methods as an approach to increase the readability of historical documents for Optical Character Recognition tools is effective. Further it was seen that common characteristics that are especially disadvantageous for Tesseract are font deviations, occurrence of non-belonging objects, character fading, broken characters, and Poisson noise.

http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-36244

Image pre-processing

Tesseract

Optical Character Recognition

Historical documents

Precision and Recall

Engineering and Technology

Teknik och teknologier

Computer Systems

Datorsystem

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:hig-36244
Date	January 2021
Creators	Aula, Lara
Publisher	Högskolan i Gävle, Datavetenskap
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0018 seconds

Improvement of Optical Character Recognition on Scanned Historical Documents Using Image Processing

Description

Links & Downloads

Tags

Additional Fields