Global ETD Search

1	Extracting Textual Data from Historical Newspaper Scans and its Challenges for 'Guerilla-Projects Wehrheim, Lino, Liebl, Bernhard, Burghardt, Manuel 11 July 2024 (has links) In 2022, it is a common place that digital historical newspapers (DHN) have become increasingly available. Despite the undeniable progress in the supply of DHN and the methods to perform rigorous quantitative analysis, however, working with DHN still poses various pitfalls, especially when scholars use data provided by third parties, such as libraries or commercial providers. Reporting from a current project, we want to share our experiences and communicate the various problems we faced while working with DHN. After a short project summary, we present the main problems that we faced in our project and that we think might also be relevant for other scholars, particularly those who work in small research groups. We arrange these problems according to an archetype workflow, which is divided into the three steps of corpus acquisition, corpus evaluation, and corpus preparation. By raising some red flags, we want to call attention to what we think common DHN related problems, to raise awareness for potential pitfalls, and, this way, to provide some guidelines for scholars who consider using DHN for their research. info:eu-repo/classification/ddc/000 ddc:000
2	From Historical Newspapers to Machine-Readable Data: The Origami OCR Pipeline Liebl, Bernhard, Burghardt, Manuel 20 June 2024 (has links) While historical newspapers recently have gained a lot of attention in the digital humanities, transforming them into machine-readable data by means of OCR poses some major challenges. In order to address these challenges, we have developed an end-to-end OCR pipeline named Origami. This pipeline is part of a current project on the digitization and quantitative analysis of the German newspaper “Berliner Börsen-Zeitung” (BBZ), from 1872 to 1931. The Origami pipeline reuses existing open source OCR components and on top offers a new configurable architecture for layout detection, a simple table recognition, a two-stage X-Y cut for reading order detection, and a new robust implementation for document dewarping. In this paper we describe the different stages of the workflow and discuss how they meet the above-mentioned challenges posed by historical newspapers. info:eu-repo/classification/ddc/006 ddc:006 info:eu-repo/classification/ddc/800 ddc:800

Search results

Extracting Textual Data from Historical Newspaper Scans and its Challenges for 'Guerilla-Projects

From Historical Newspapers to Machine-Readable Data: The Origami OCR Pipeline