Global ETD Search

1	Mitigating Unintended Bias in Toxic Comment Detection using Entropy-based Attention Regularization / Att mildra oavsiktlig bias i detektering av giftiga kommentarer med hjälp av entropibaserad uppmärksamhetsreglering. Camerota, Fabio January 2023 (has links) The proliferation of hate speech is a growing challenge for social media platforms, as toxic online comments can have dangerous consequences also in real life. There is a need for tools that can automatically and reliably detect hateful comments, and deep learning models have proven effective in solving this issue. However, these models have been shown to have unintended bias against some categories of people. Specifically, they may classify comments that reference certain frequently attacked identities (such as gay, black, or Muslim) as toxic even if the comments themselves are actually not toxic (e.g. ”I am Muslim”). To address this bias, previous authors introduced an Entropy-based Attention Regularization (EAR) method which, when applied to BERT, has been shown to reduce its unintended bias. In this study, the EAR method was applied not only to BERT, but also to XLNet. The investigation involved the comparison of four models: BERT, BERT+EAR, XLNet, and XLNet+EAR. Several experiments were performed, and the associated code is available on GitHub. The classification performance of these models was measured using the F1-score on a public data set containing comments collected from Wikipedia forums. While their unintended bias was evaluated by employing AUC-based metrics on a synthetic data set consisting of 50 identities grouped into four macro categories: Gender & Sexual orientation, Ethnicity, Religion, and Age & Physical disability. The results of the AUC-based metrics proved that EAR performs well on both BERT and XLNet, successfully reducing their unintended bias towards the 50 identity terms considered in the experiments. Conversely, the F1-score results demonstrated a negative impact of EAR on the classification performance of both BERT and XLNet. / Spridningen av hatpropaganda är en växande utmaning för sociala medieplattformar, eftersom giftiga kommentarer på nätet kan få farliga konsekvenser även i verkliga livet. Det behövs verktyg som automatiskt och tillförlitligt kan upptäcka hatiska kommentarer, och djupinlärningsmodeller har visat sig vara effektiva för att lösa detta problem. Dessa modeller har dock visat sig ha oavsiktliga fördomar mot vissa kategorier av människor. I synnerhet kan de klassificera kommentarer som hänvisar till vissa ofta attackerade identiteter (som homosexuella, svarta eller muslimer) som giftiga även om kommentarerna i sig faktiskt inte är giftiga (t.ex. ”Jag är muslim”). För att hantera denna bias introducerade tidigare författare en entropibaserad uppmärksamhetsregleringsmetod (EAR) som, när den tillämpas på BERT, har visat sig minska dess oavsiktliga bias. I den här studien tillämpades EAR-metoden inte bara på BERT utan även på XLNet. Undersökningen omfattade en jämförelse av fyra modeller: BERT, BERT+EAR, XLNet och XLNet+EAR. Flera experiment utfördes, och den tillhörande koden finns tillgänglig på GitHub. Klassificeringsprestandan för dessa modeller mättes med F1-poängen på en offentlig datauppsättning som innehåller kommentarer som samlats in från Wikipedia-forum. Medan deras oavsiktliga bias utvärderades genom att använda AUC-baserade mätvärden på en syntetisk datauppsättning bestående av 50 identiteter grupperade i fyra makrokategorier: Kön & Sexuell läggning, Etnicitet, Religion och Ålder & Fysisk funktionsnedsättning. Resultaten av de AUC-baserade mätvärdena visade att EAR fungerar bra på både BERT och XLNet, vilket framgångsrikt minskar deras oavsiktliga bias mot de 50 identitetstermer som beaktas i experimenten. Omvänt visade F1-score-resultaten en negativ inverkan av EAR på klassificeringsprestandan för både BERT och XLNet. XLNet BERT Toxic Comment Classification Entropy-based Attention Regularization XLNet BERT Toxisk Kommentar Klassificering Entropibaserad uppmärksamhetsreglering Computer Sciences Datavetenskap (datalogi) Computer Engineering Datorteknik
2	Evaluation of Approaches for Representation and Sentiment of Customer Reviews / Utvärdering av tillvägagångssätt för representation och uppfattning om kundrecensioner Giorgis, Stavros January 2021 (has links) Classification of sentiment on customer reviews is a real-world application for many companies that offer text analytics and opinion extraction on customer reviews on different domains such as consumer electronics, hotels, restaurants, and car rental agencies. Natural Language Processing’s latest progress has seen the development of many new state-of-the-art approaches for representing the meaning of sentences, phrases, and words in the text using vector space models, so-called embeddings. In this thesis, we evaluated the most current and most popular text representation techniques against traditional methods as a baseline. The evaluation dataset consists of customer reviews from different domains with different lengths used by a text analysis company. Through a train dataset exploration, we evaluated which datasets were the most suitable for this specific task. Furthermore, we explored different techniques that could be used to alter a language model’s decisions without retraining it. Finally, all the methods were evaluated against their time performance and the resource requirements to present an overall experimental assessment that could potentially help the company decide which is the most appropriate technique to replace its system in a production environment. / Klassificeringen av attityd och känsloläge i kundrecensioner är en tillämpning med praktiskt värde för flera företag i marknadsanalysbranschen. Aktuell forskning i språkteknologi har etablerat vektorrum som standardrepresentation för ord, fraser och yttranden, så kallade embeddings. Denna uppsats utvärderar den senaste tidens mest framgångsrika textrepresentationsmodeller jämfört med mer traditionella vektorrum. Utvärdering görs genom att jämföra automatiska analyser med mänskliga bedömningar för kundrecensioner av varierande längd från olika domäner tillhandahållna av ett textanalysföretag. Inom ramen för studien har olika testmängder jämförts och olika sätt att modifera en språkmodells klassficering utan om träning. Alla modeller har också jämförts med avseende på resurs- och tidsåtgång för träning för att hjälpa uppdragsgivaren fatta beslut om vilken teknik som utgör den mest ändamålsenliga utvecklingsvägen för dess driftsatta system. machine learning nlp text analytics sentiment analysis transformers tfidf bow fasttext word2vec bert xlnet roberta maskininlärning nlp textanalys sentimentanalys transformatorer tfidf bow fasttext word2vec bert xlnet roberta Computer and Information Sciences Data- och informationsvetenskap
3	Klasifikace vztahů mezi pojmenovanými entitami v textu / Classification of Relations between Named Entities in Text Ondřej, Karel January 2020 (has links) This master thesis deals with the extraction of relationships between named entities in the text. In the theoretical part of the thesis, the issue of natural language representation for machine processing is discussed. Subsequently, two partial tasks of relationship extraction are defined, namely named entities recognition and classification of relationships between them, including a summary of state-of-the-art solutions. In the practical part of the thesis, system for automatic extraction of relationships between named entities from downloaded pages is designed. The classification of relationships between entities is based on the pre-trained transformers. In this thesis, four pre-trained transformers are compared, namely BERT, XLNet, RoBERTa and ALBERT.
4	Filtrování spamových zpráv pomocí metod umělé inteligence / Email spam filtering using artificial intelligence Safonov, Yehor January 2020 (has links) In the modern world, email communication defines itself as the most used technology for exchanging messages between users. It is based on three pillars which contribute to the popularity and stimulate its rapid growth. These pillars are represented by free availability, efficiency and intuitiveness during exchange of information. All of them constitute a significant advantage in the provision of communication services. On the other hand, the growing popularity of email technologies poses considerable security risks and transforms them into an universal tool for spreading unsolicited content. Potential attacks may be aimed at either a specific endpoints or whole computer infrastructures. Despite achieving high accuracy during spam filtering, traditional techniques do not often catch up to rapid growth and evolution of spam techniques. These approaches are affected by overfitting issues, converging into a poor local minimum, inefficiency in highdimensional data processing and have long-term maintainability issues. One of the main goals of this master's thesis is to develop and train deep neural networks using the latest machine learning techniques for successfully solving text-based spam classification problem belonging to the Natural Language Processing (NLP) domain. From a theoretical point of view, the master's thesis is focused on the e-mail communication area with an emphasis on spam filtering. Next parts of the thesis bring attention to the domain of machine learning and artificial neural networks, discuss principles of their operations and basic properties. The theoretical part also covers possible ways of applying described techniques to the area of text analysis and solving NLP. One of the key aspects of the study lies in a detailed comparison of current machine learning methods, their specifics and accuracy when applied to spam filtering. At the beginning of the practical part, focus will be placed on the e-mail dataset processing. This phase was divided into five stages with the motivation of maintaining key features of the raw data and increasing the final quality of the dataset. The created dataset was used for training, testing and validation of types of the chosen deep neural networks. Selected models ULMFiT, BERT and XLNet have been successfully implemented. The master's thesis includes a description of the final data adaptation, neural networks learning process, their testing and validation. In the end of the work, the implemented models are compared using a confusion matrix and possible improvements and concise conclusion are also outlined.

1

Page generated in 0.0229 seconds