Global ETD Search

Return to search

Multilingual identification of offensive content in social media

In today’s society there is a large number of social media users that are free to express their opinion on shared platforms. The socio-cultural differences between the people behind those accounts (in terms of ethnicity, gender, sexual orientation, religion, politics, . . . ) give rise to an important percentage of online discussions that make use of offensive language, which often affects in a negative way the psychological well-being of the victims. In order to address the problem, the endless stream of user-generated content engenders a need to find an accurate and scalable solution to detect offensive language using automated methods. This thesis explores different approaches to the offensiveness detection task focusing on five different languages: Arabic, Danish, English, Greek and Turkish. The results obtained using Support Vector Machines (SVM), Convolutional Neural Networks (CNN) and the Bidirectional Encoder Representations from Transformers (BERT) are compared, achieving state-of-the-art results with some of the methods tested. The effect of the embeddings used, the dataset size, the class imbalance percentage and the addition of sentiment features are studied and analysed, as well as the cross-lingual capabilities of pre-trained multilingual models.

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167381

natural language processing

artificial intelligence

machine learning

bert

text classification

Engineering and Technology

Teknik och teknologier

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-167381
Date	January 2020
Creators	Pàmies Massip, Marc
Publisher	Linköpings universitet, Artificiell intelligens och integrerade datorsystem
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0021 seconds

Multilingual identification of offensive content in social media

Description

Links & Downloads

Tags

Additional Fields