Global ETD Search

Return to search

New Approaches to OCR for Early Printed Books

Books printed before 1800 present major problems for OCR. One of the mainobstacles is the lack of diversity of historical fonts in training data. The OCR-D project, consisting of book historians and computer scientists, aims to address this deficiency by focussing on three major issues. Our first target wasto create a tool that identifies font groups automatically in images of histori-cal documents. We concentrated on Gothic font groups that were commonlyused in German texts printed in the 15thand 16th century: the well-known Fraktur and the lesser known Bastarda, Rotunda, Textura und Schwabacher. The tool was trained with 35,000 images and reaches an accuracy level of 98%. It can not only differentiate between the above-mentioned font groupsbut also Hebrew, Greek, Antiqua and Italic. It can also identify woodcut im-ages and irrelevant data (book covers, empty pages, etc.). In a second step,we created an online training infrastructure (okralact), which allows for theuse of various open source OCR engines such as Tesseract, OCRopus, Krakenand Calamari. At the same time, it facilitates training for specific models offont groups. The high accuracy of the recognition tool paves the way for theunprecedented opportunity to differentiate between the fonts used by individual printers. With more training data and further adjustments, the toolcould help to fill a major gap in historical research

info:eu-repo/classification/ddc/002

ddc:002

info:eu-repo/classification/ddc/006

ddc:006

Identifer	oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:91685
Date	29 May 2024
Creators	Weichselbaumer, Nikolaus, Seuret, Mathias, Limbach, Saskia, Dong, Rui, Burghardt, Manuel, Christlein, Vincent
Publisher	ICCU
Source Sets	Hochschulschriftenserver (HSSS) der SLUB Dresden
Language	English
Detected Language	English
Type	info:eu-repo/semantics/publishedVersion, doc-type:article, info:eu-repo/semantics/article, doc-type:Text
Rights	info:eu-repo/semantics/openAccess
Relation	1972-621X

Page generated in 0.0013 seconds

New Approaches to OCR for Early Printed Books

Description

Links & Downloads

Tags

Additional Fields