Complex Word Identification (CWI) is a task of identifying complex words in text data and it is often viewed as a subtask of Automatic Text Simplification (ATS) where the main task is making a complex text simpler. The ways in which a text should be simplified depend on the target readers such as second language learners or people with reading disabilities. In this thesis, we focus on Complex Word Identification for Swedish. First, in addition to exploring existing resources, we collect a new dataset for Swedish CWI. We continue by building several classifiers of Swedish simple and complex words. We then use the findings to analyze the characteristics of lexical complexity in Swedish and English. Our method for collecting training data based on second language learning material has shown positive evaluation scores and resulted in a new dataset for Swedish CWI. Additionally, the built complex word classifiers have an accuracy at least as good as similar systems for English. Finally, the analysis of the selected features confirms the findings of previous studies and reveals some interesting characteristics of lexical complexity.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-352349 |
Date | January 2018 |
Creators | Smolenska, Greta |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0118 seconds