Return to search

A Pipeline for Automatic Lexical Normalization of Swedish Student Writings

In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline design should consider the unique morphological and phonological characteristics of the Swedish language. This kind of localization makes the system more robust for Swedish at the cost of being less applicable to other languages in similar tasks. The core of the localization lies in a phonetic algorithm we designed specifically for the Swedish language and a compound processing step for Swedish compounding phenomenon. The proposed pipeline consists of four steps, namely preprocessing, identification of out-of-vocabulary words, generation of normalization candidates and candidate selection. For each step we use different approaches. We perform experiments on the Uppsala Corpus of Student Writings (UCSW) (Megyesi et al., 2016), and evaluate the results in termsof precision, recall and accuracy measures. The techniques applied to the raw data and their impacts on the final result are presented. In our evaluation, we show that the pipeline can be useful in the lexical normalization task and our phonetic algorithm is proven to be effective for the Swedish language.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-352450
Date January 2018
CreatorsLiu, Yuhan
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0026 seconds