Return to search

New Approaches to OCR for Early Printed Books

Books printed before 1800 present major problems for OCR. One of the mainobstacles is the lack of diversity of historical fonts in training data. The OCR-D project, consisting of book historians and computer scientists, aims to address this deficiency by focussing on three major issues. Our first target wasto create a tool that identifies font groups automatically in images of histori-cal documents. We concentrated on Gothic font groups that were commonlyused in German texts printed in the 15thand 16th century: the well-known Fraktur and the lesser known Bastarda, Rotunda, Textura und Schwabacher. The tool was trained with 35,000 images and reaches an accuracy level of 98%. It can not only differentiate between the above-mentioned font groupsbut also Hebrew, Greek, Antiqua and Italic. It can also identify woodcut im-ages and irrelevant data (book covers, empty pages, etc.). In a second step,we created an online training infrastructure (okralact), which allows for theuse of various open source OCR engines such as Tesseract, OCRopus, Krakenand Calamari. At the same time, it facilitates training for specific models offont groups. The high accuracy of the recognition tool paves the way for theunprecedented opportunity to differentiate between the fonts used by individual printers. With more training data and further adjustments, the toolcould help to fill a major gap in historical research

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:91685
Date29 May 2024
CreatorsWeichselbaumer, Nikolaus, Seuret, Mathias, Limbach, Saskia, Dong, Rui, Burghardt, Manuel, Christlein, Vincent
PublisherICCU
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, doc-type:article, info:eu-repo/semantics/article, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess
Relation1972-621X

Page generated in 0.0013 seconds