• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • 1
  • Tagged with
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Comparing Different Transformer Models’ Performance for Identifying Toxic Language Online

Sundelin, Carl January 2023 (has links)
There is a growing use of the internet and alongside that, there has been an increase in the use of toxic language towards other people that can be harmful to those that it targets. The usefulness of artificial intelligence has exploded in recent years with the development of natural language processing, especially with the use of transformers. One of the first ones was BERT, and that has spawned many variations including ones that aim to be more lightweight than the original ones. The goal of this project was to train three different kinds of transformer models, RoBERTa, ALBERT, and DistilBERT, and find out which one was best at identifying toxic language online. The models were trained on a handful of existing datasets that had labelled data as abusive, hateful, harassing, and other kinds of toxic language. These datasets were combined to create a dataset that was used to train and test all of the models. When tested against data collected in the datasets, there was very little difference in the overall performance of the models. The biggest difference was how long it took to train them with ALBERT taking approximately 2 hours, RoBERTa, around 1 hour and DistilBERT just over half an hour. To understand how well the models worked in a real-world scenario, the models were evaluated by labelling text as toxic or non-toxic on three different subreddits. Here, a larger difference in performance showed up. DistilBERT labelled significantly fewer instances as toxic compared to the other models. A sample of the classified data was manually annotated, and it showed that the RoBERTa and DistilBERT models still performed similarly to each other. A second evaluation was done on the data from Reddit and a threshold of 80% certainty was required for the classification to be considered toxic. This led to an average of 28% of instances being classified as toxic by RoBERTa, whereas ALBERT and DistilBERT classified an average of 14% and 11% as toxic respectively. When the results from the RoBERTa and DistilBERT models were manually annotated, a significant improvement could be seen in the performance of the models. This led to the conclusion that the DistilBERT model was the most suitable model for training and classifying toxic language of the lightweight models tested in this work.
2

Skadligt innehåll på nätet - Toxiskt språk på TikTok

Wester, Linn, Stenvall, Elin January 2024 (has links)
Toxiskt språk på internet och det som ofta i vardagliga termer benämns som näthat innefattar kränkningar, hot och stötande språk. Toxiskt språk är särskilt märkbart på sociala medier. Det går att upptäcka toxiskt språk på internet med hjälp av maskininlärning som automatiskt känner igen typiska särdrag för toxiskt språk. Tidigare svensk forskning har undersökt förekomsten av toxiskt språk på sociala medier med hjälp av maskininlärning, men det saknas fortfarande forskning på den allt mer populära plattformen TikTok. Syftet med denna studie är att undersöka förekomsten och särdragen av toxiska kommentarer på TikTok med hjälp av maskininlärning och manuella metoder. Studien är menad att ge en bättre förståelse för vad unga möts av i kommentarerna på TikTok. Studien applicerar en mixad metod i en dokumentundersökning av 69 895 kommentarer. Maskininlärningsmodellen Hatescan användes för att automatiskt klassificera sannolikheten att toxiskt språk förekommer i kommentarerna. Utifrån denna sannolikhet analyserades ett urval av kommentarerna manuellt, vilket ledde till både kvantitativa och kvalitativa fynd. Resultatet av studien visade att omfattningen av toxiskt språk var relativt liten, där 0,24% av 69 895 kommentarer ansågs vara toxiska enligt en både automatiserad och manuell bedömning. Den typ av toxiskt språk som mest förekom i undersökningen visades vara obscent språk, som till majoriteten innehöll svordomar. / Toxic language on the internet and what is often referred to in everyday terms as cyberbullying includes insults, threats and offensive language. Toxic language is particularly noticeable on social media. It is possible to detect toxic language on the internet with the help of machine learning in the form of, among other things, Natural Language Processing (NLP) techniques, which automatically recognize typical characteristics of toxic language. Previous Swedish research has investigated the presence of toxic language on social media using machine learning, but there is still a lack of research on the increasingly popular platform TikTok. Through the study, the authors intend to investigate the prevalence and characteristics of toxic comments on TikTok using both a machine learning technique and manual methods. The study is meant to provide a better understanding of what young people encounter in the comments on TikTok. The study applies a mixed method in a document survey of 69 895 comments. Hatescan was used to automatically classify the likelihood of toxic language appearing in the comments. Based on this probability, a section of the comments could be sampled and manually analysed using theory, leading to both quantitative and qualitative findings. The results of the study showed that the prevalence of toxic language was relatively small, with 0.24% of 69 895 comments considered toxic based on an automatic and manual analysis. The type of toxic language that occurred the most in the study was shown to be obscene language, the majority of which contained swear words.

Page generated in 0.0575 seconds