Global ETD Search

Return to search

From Historical Newspapers to Machine-Readable Data: The Origami OCR Pipeline

While historical newspapers recently have gained a lot of attention in the digital humanities, transforming them into machine-readable data by means of OCR poses some major challenges. In order
to address these challenges, we have developed an end-to-end OCR pipeline named Origami. This
pipeline is part of a current project on the digitization and quantitative analysis of the German
newspaper “Berliner Börsen-Zeitung” (BBZ), from 1872 to 1931. The Origami pipeline reuses existing open source OCR components and on top offers a new configurable architecture for layout
detection, a simple table recognition, a two-stage X-Y cut for reading order detection, and a new
robust implementation for document dewarping. In this paper we describe the different stages of the
workflow and discuss how they meet the above-mentioned challenges posed by historical newspapers.

info:eu-repo/classification/ddc/006

ddc:006

info:eu-repo/classification/ddc/800

ddc:800

Identifer	oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:92168
Date	20 June 2024
Creators	Liebl, Bernhard, Burghardt, Manuel
Publisher	CEUR-WS.org
Source Sets	Hochschulschriftenserver (HSSS) der SLUB Dresden
Language	English
Detected Language	English
Type	info:eu-repo/semantics/publishedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0034 seconds

From Historical Newspapers to Machine-Readable Data: The Origami OCR Pipeline

Description

Links & Downloads

Tags

Additional Fields