The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. Therefore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic text processing with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard word forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance (Levenshtein distance approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-424777 |
Date | January 2020 |
Creators | Toska, Marsida |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.002 seconds