• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

In-Domain and Cross-Domain Classification of Patronizing and Condescending Language in Social Media and News Texts : A Study in Implicitly Aggressive Language Detection and Methods

Ortiz, Flor January 2022 (has links)
The field of aggressive language detection is developing quickly in Natural Language Processing. However, most of the work being done in this field is centered around explicitly aggressive language, whereas work exploring forms of implicitly aggressive language is much less prolific thus far. Further, there are many subcategories that are encompassed within the greater category of implicitly aggressive language, for example, condescending and patronizing language. This thesis focuses on the relatively new field of patronizing and condescending language (PCL) detection, specifically on expanding away from in-domain tasks that focus on either news or social media texts. Cross-domain patronizing and condescending language detection is as of today not a widely explored sub-field of Natural Language Processing. In this project, the aim to answer three main research questions: the first is to what extent do models trained to detect patronizing and condescending language in one domain, in this case social media texts and news publications, generalize to other domains. Secondly, we aim to make advances toward a baseline for balanced PCL datasets and compare performance across label distribution ratios. Thirdly, we aim to address the impact of a common feature in patronizing and condescending language datasets--the significant imbalance between negative and positive labels in the binary classification task. To this end, we aim to address the question of to what extent does the proportion between labels have an impact on the in-domain PCL classification task.  We find that the best performing model for the in-domain classification task is the Gradient Boosting classifier trained on an imbalanced dataset harvested from Reddit, which included both the post and the reply, with a ratio of 1:2 between positive and negative labels. In the cross-domain task, we find that the best performing model is an SVM trained on the balanced news dataset and evaluated on the balanced Reddit post and reply dataset. In the latter study, we show that it is possible to achieve competitive results using classical machine models on a nuanced, context-dependent binary classification task.

Page generated in 0.0987 seconds