1 |
Extracting Textual Data from Historical Newspaper Scans and its Challenges for 'Guerilla-ProjectsWehrheim, Lino, Liebl, Bernhard, Burghardt, Manuel 11 July 2024 (has links)
In 2022, it is a common place that digital historical newspapers (DHN) have become
increasingly available. Despite the undeniable progress in the supply of DHN and the methods to
perform rigorous quantitative analysis, however, working with DHN still poses various pitfalls,
especially when scholars use data provided by third parties, such as libraries or commercial
providers. Reporting from a current project, we want to share our experiences and communicate
the various problems we faced while working with DHN. After a short project summary, we
present the main problems that we faced in our project and that we think might also be relevant
for other scholars, particularly those who work in small research groups. We arrange these
problems according to an archetype workflow, which is divided into the three steps of corpus
acquisition, corpus evaluation, and corpus preparation. By raising some red flags, we want to call
attention to what we think common DHN related problems, to raise awareness for potential
pitfalls, and, this way, to provide some guidelines for scholars who consider using DHN for their
research.
|
2 |
Named-entity recognition in Czech historical texts : Using a CNN-BiLSTM neural network modelHubková, Helena January 2019 (has links)
The thesis presents named-entity recognition in Czech historical newspapers from Modern Access to Historical Sources Project. Our goal was to create a specific corpus and annotation manual for the project and evaluate neural networks methods for named-entity recognition within the task. We created the corpus using scanned Czech historical newspapers. The scanned pages were converted to digitize text by optical character recognition (OCR) method. The data were preprocessed by deleting some OCR errors. We also defined specific named entities types for our task and created an annotation manual with examples for the project. Based on that, we annotated the final corpus. To find the most suitable neural networks model for our task, we experimented with different neural networks architectures, namely long short-term memory (LSTM), bidirectional LSTM and CNN-BiLSTM models. Moreover, we experimented with randomly initialized word embeddings that were trained during the training process and pretrained word embeddings for contemporary Czech published as open source by fastText. We achieved the best result F1 score 0.444 using CNN-BiLSTM model and the pretrained word embeddings by fastText. We found out that we do not need to normalize spelling of our historical texts to get closer to contemporary language if we use the neural network model. We provided a qualitative analysis of observed linguistics phenomena as well. We found out that some word forms and pair of words which were not frequent in our training data set were miss-tagged or not tagged at all. Based on that, we can say that larger data sets could improve the results.
|
3 |
News article segmentation using multimodal input : Using Mask R-CNN and sentence transformers / Artikelsegmentering med multimodala artificiella neuronnätverk : Med hjälp av Mask R-CNN och sentence transformersHenning, Gustav January 2022 (has links)
In this century and the last, serious efforts have been made to digitize the content housed by libraries across the world. In order to open up these volumes to content-based information retrieval, independent elements such as headlines, body text, bylines, images and captions ideally need to be connected semantically as article-level units. To query on facets such as author, section, content type or other metadata, further processing of these documents is required. Even though humans have shown exceptional ability to segment different types of elements into related components, even in languages foreign to them, this task has proven difficult for computers. The challenge of semantic segmentation in newspapers lies in the diversity of the medium: Newspapers have vastly different layouts, covering diverse content, from news articles to ads to weather reports. State-of-the-art object detection and segmentation models have been trained to detect and segment real-world objects. It is not clear whether these architectures can perform equally well when applied to scanned images of printed text. In the domain of newspapers, in addition to the images themselves, we have access to textual information through Optical Character Recognition. The recent progress made in the field of instance segmentation of real-world objects using deep learning techniques begs the question: Can the same methodology be applied in the domain of newspaper articles? In this thesis we investigate one possible approach to encode the textual signal into the image in an attempt to improve performance. Based on newspapers from the National Library of Sweden, we investigate the predictive power of visual and textual features and their capacity to generalize across different typographic designs. Results show impressive mean Average Precision scores (>0:9) for test sets sampled from the same newspaper designs as the training data when using only the image modality. / I detta och det förra århundradet har kraftiga åtaganden gjorts för att digitalisera traditionellt medieinnehåll som tidigare endast tryckts i pappersformat. För att kunna stödja sökningar och fasetter i detta innehåll krävs bearbetning påsemantisk nivå, det vill säga att innehållet styckas upp påartikelnivå, istället för per sida. Trots att människor har lätt att dela upp innehåll påsemantisk nivå, även påett främmande språk, fortsätter arbetet för automatisering av denna uppgift. Utmaningen i att segmentera nyhetsartiklar återfinns i mångfalden av utseende och format. Innehållet är även detta mångfaldigt, där man återfinner allt ifrån faktamässiga artiklar, till debatter, listor av fakta och upplysningar, reklam och väder bland annat. Stora framsteg har gjorts inom djupinlärning just för objektdetektering och semantisk segmentering bara de senaste årtiondet. Frågan vi ställer oss är: Kan samma metodik appliceras inom domänen nyhetsartiklar? Dessa modeller är skapta för att klassificera världsliga ting. I denna domän har vi tillgång till texten och dess koordinater via en potentiellt bristfällig optisk teckenigenkänning. Vi undersöker ett sätt att utnyttja denna textinformation i ett försök att förbättra resultatet i denna specifika domän. Baserat pådata från Kungliga Biblioteket undersöker vi hur väl denna metod lämpar sig för uppstyckandet av innehåll i tidningar längsmed tidsperioder där designen förändrar sig markant. Resultaten visar att Mask R-CNN lämpar sig väl för användning inom domänen nyhetsartikelsegmentering, även utan texten som input till modellen.
|
4 |
From Historical Newspapers to Machine-Readable Data: The Origami OCR PipelineLiebl, Bernhard, Burghardt, Manuel 20 June 2024 (has links)
While historical newspapers recently have gained a lot of attention in the digital humanities, transforming them into machine-readable data by means of OCR poses some major challenges. In order
to address these challenges, we have developed an end-to-end OCR pipeline named Origami. This
pipeline is part of a current project on the digitization and quantitative analysis of the German
newspaper “Berliner Börsen-Zeitung” (BBZ), from 1872 to 1931. The Origami pipeline reuses existing open source OCR components and on top offers a new configurable architecture for layout
detection, a simple table recognition, a two-stage X-Y cut for reading order detection, and a new
robust implementation for document dewarping. In this paper we describe the different stages of the
workflow and discuss how they meet the above-mentioned challenges posed by historical newspapers.
|
Page generated in 0.0916 seconds