Return to search

In-Domain and Cross-Domain Classification of Patronizing and Condescending Language in Social Media and News Texts : A Study in Implicitly Aggressive Language Detection and Methods

The field of aggressive language detection is developing quickly in Natural Language Processing. However, most of the work being done in this field is centered around explicitly aggressive language, whereas work exploring forms of implicitly aggressive language is much less prolific thus far. Further, there are many subcategories that are encompassed within the greater category of implicitly aggressive language, for example, condescending and patronizing language. This thesis focuses on the relatively new field of patronizing and condescending language (PCL) detection, specifically on expanding away from in-domain tasks that focus on either news or social media texts. Cross-domain patronizing and condescending language detection is as of today not a widely explored sub-field of Natural Language Processing. In this project, the aim to answer three main research questions: the first is to what extent do models trained to detect patronizing and condescending language in one domain, in this case social media texts and news publications, generalize to other domains. Secondly, we aim to make advances toward a baseline for balanced PCL datasets and compare performance across label distribution ratios. Thirdly, we aim to address the impact of a common feature in patronizing and condescending language datasets--the significant imbalance between negative and positive labels in the binary classification task. To this end, we aim to address the question of to what extent does the proportion between labels have an impact on the in-domain PCL classification task.  We find that the best performing model for the in-domain classification task is the Gradient Boosting classifier trained on an imbalanced dataset harvested from Reddit, which included both the post and the reply, with a ratio of 1:2 between positive and negative labels. In the cross-domain task, we find that the best performing model is an SVM trained on the balanced news dataset and evaluated on the balanced Reddit post and reply dataset. In the latter study, we show that it is possible to achieve competitive results using classical machine models on a nuanced, context-dependent binary classification task.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-488274
Date January 2022
CreatorsOrtiz, Flor
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess
RelationZoon. Suppl., 0346-9123

Page generated in 0.0242 seconds