In today’s society there is a large number of social media users that are free to express their opinion on shared platforms. The socio-cultural differences between the people behind those accounts (in terms of ethnicity, gender, sexual orientation, religion, politics, . . . ) give rise to an important percentage of online discussions that make use of offensive language, which often affects in a negative way the psychological well-being of the victims. In order to address the problem, the endless stream of user-generated content engenders a need to find an accurate and scalable solution to detect offensive language using automated methods. This thesis explores different approaches to the offensiveness detection task focusing on five different languages: Arabic, Danish, English, Greek and Turkish. The results obtained using Support Vector Machines (SVM), Convolutional Neural Networks (CNN) and the Bidirectional Encoder Representations from Transformers (BERT) are compared, achieving state-of-the-art results with some of the methods tested. The effect of the embeddings used, the dataset size, the class imbalance percentage and the addition of sentiment features are studied and analysed, as well as the cross-lingual capabilities of pre-trained multilingual models.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-167381 |
Date | January 2020 |
Creators | Pàmies Massip, Marc |
Publisher | Linköpings universitet, Artificiell intelligens och integrerade datorsystem |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0021 seconds