• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Curating news sections in a historical Swedish news corpus

Rekathati, Faton January 2020 (has links)
The National Library of Sweden uses optical character recognition software to digitize their collections of historical newspapers. The purpose of such software is first to automatically segment text and images from scanned newspaper pages, and second to read the contents of the identified text regions. While the raw text is often digitized successfully, important contextual information regarding whether the text constitutes for example a header, a section title or the body text of an article is not captured. These characteristics are easy for a human to distinguish, yet they remain difficult for a machine to recognize. The main purpose of this thesis is to investigate how well section titles in the newspaper Svenska Dagbladet can be classified by using so called image embeddings as features. A secondary aim is to examine whether section titles become harder to classify in older newspaper data. Lastly, we explore if manual annotation work can be reduced using the predictions of a semi-supervised classifier to help in the labeling process.  Results indicate the use of image embeddings help quite substantially in classifying section titles. Datasets from three different time periods: 1990-1997, 2004-2013, and 2017 and onwards were sampled and annotated. The best performing model (Xgboost) achieved macro F1 scores of 0.886, 0.936 and 0.980 for the respective time periods. The results also showed classification became more difficult on older newspapers. Furthermore, a semi-supervised classifier managed an average precision of 83% with only single section title examples, showing promise as way to speed up manual annotation of data.

Page generated in 1.4773 seconds