In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline design should consider the unique morphological and phonological characteristics of the Swedish language. This kind of localization makes the system more robust for Swedish at the cost of being less applicable to other languages in similar tasks. The core of the localization lies in a phonetic algorithm we designed specifically for the Swedish language and a compound processing step for Swedish compounding phenomenon. The proposed pipeline consists of four steps, namely preprocessing, identification of out-of-vocabulary words, generation of normalization candidates and candidate selection. For each step we use different approaches. We perform experiments on the Uppsala Corpus of Student Writings (UCSW) (Megyesi et al., 2016), and evaluate the results in termsof precision, recall and accuracy measures. The techniques applied to the raw data and their impacts on the final result are presented. In our evaluation, we show that the pipeline can be useful in the lexical normalization task and our phonetic algorithm is proven to be effective for the Swedish language.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-352450 |
Date | January 2018 |
Creators | Liu, Yuhan |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0049 seconds