• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • Tagged with
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Unsupervised Normalisation of Historical Spelling : A Multilingual Evaluation

Bergman, Nicklas January 2018 (has links)
Historical texts are an important resource for researchers in the humanities. However, standard NLP tools typically perform poorly on them, mainly due to the spelling variations present in such texts. One possible solution is to normalise the spelling variations to equivalent contemporary word forms before using standard tools. Weighted edit distance has previously been used for such normalisation, improving over the results of algorithms based on standard edit distance. Aligned training data is needed to extract weights, but there is a lack of such data. An unsupervised method for extracting edit distance weights is therefore desirable. This thesis presents a multilingual evaluation of an unsupervised method for extracting edit distance weights for normalisation of historical spelling variations. The model is evaluated for English, German, Hungarian, Icelandic and Swedish. The results are mixed and show a high variance depending on the different data sets. The method generally performs better than normalisation basedon standard edit distance but as expected does not quite reach up to the results of a model trained on aligned data. The results show an increase in normalisation accuracy compared to standard edit distance normalisation for all languages except German, which shows a slightly reduced accuracy, and Swedish, which shows similar results to the standard edit distance normalisation.
2

Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction

Pettersson, Eva January 2016 (has links)
Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user. An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text. In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting.

Page generated in 0.1435 seconds