Global ETD Search

Improving the quality of the text, a pilot project to assess and correct the OCR in a multilingual environment

The user expectation from a digitized collection is that a full text search can be performed and that it will retrieve all the relevant results. The reality is, however, that the errors introduced during Optical Character Recognition (OCR) degrade the results significantly and users do not get what they expect. The National Library of Luxembourg started its digitization program in 2000 and in 2005 started performing OCR on the scanned images. The OCR was always performed by the scanning suppliers, so over the years quite a lot of different OCR programs in different versions have been used. The manual parts of the digitization chain (handling, scanning, zoning, …) are difficult, costly and mostly incompressible, so the library thought that the supplier should focus on a high quality level for these parts. OCR is an automated process and so the library believed that the text recognized by the OCR could be improved automatically since OCR software improves over the years. This is why the library has never asked the supplier for a minimum recognition rate.
The author is proposing to test this assumption by first evaluating the base quality of the text extracted by the original supplier, followed by running a contemporary OCR program and finally comparing its quality to the first extraction. The corpus used is the collection of digitized newspapers from Luxembourg, published from the 18th century to the 20th century. A complicating element is that the corpus consists of three main languages, German, French and Luxembourgish, which are often present on a single newspaper page together. A preliminary step is hence added to detect the language used in a block of text so that the correct dictionaries and OCR engines can be used.

full text, OCR, quality

Volltext, OCR, Qualität

info:eu-repo/classification/ddc/004

ddc:004

Identifer	oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:16445
Date	16 October 2017
Creators	Maurer, Yves
Publisher	Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden
Source Sets	Hochschulschriftenserver (HSSS) der SLUB Dresden
Language	English
Detected Language	English
Type	doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text
Rights	info:eu-repo/semantics/openAccess
Relation	urn:nbn:de:bsz:14-qucosa2-163412, qucosa:16341

Page generated in 0.0479 seconds

Improving the quality of the text, a pilot project to assess and correct the OCR in a multilingual environment

Description

Links & Downloads

Tags

Additional Fields