Global ETD Search

1	Analysing the possibilities of a needs-based house configurator Ermolaev, Roman January 2023 (has links) A needs-based configurator is a system or tool that assists users in customizing products based on their specific needs. This thesis investigates the challenges of obtaining data for a needs-based machine learning house configurator and identifies suitable models for its implementation. The study consists of two parts: first, an analysis of how to obtain data, and second, an evaluation of three models for implementing the needs-based solution. The analysis shows that collecting house review data for a needs-based configurator is challenging due to several factors, including how the housing market operates compared to other markets, privacy concerns, and the complexity of the buying process. To address this, future studies could consider alternative data sources, adding contextual data, and creating surveys or questionnaires. The evaluation of three models: DistilBERT, BERT fine-tuned for Swedish, and a CNN with a Swedish word embedding layer, shows that both the BERT models perform well on the generated dataset, while the CNN model underperformed. The Swedish BERT model performed the best, achieving high recall and precision metrics for k between 2 and 5. This thesis suggests that further research on needs-based configurators should focus on alternative data sources and more extensive datasets to improve performance. Needs-based Configurator House configurator CNN BERT DistilBERT Swedish Interaction Technologies Interaktionsteknik Computer Systems Datorsystem
2	Cyberbullying Detection Using Weakly Supervised and Fully Supervised Learning Abhishek, Abhinav 22 September 2022 (has links) No description available. Computer Science
3	Automatic Analysis of Peer Feedback using Machine Learning and Explainable Artificial Intelligence / Automatisk analys av Peer feedback med hjälp av maskininlärning och förklarig artificiell Intelligence Huang, Kevin January 2023 (has links) Peer assessment is a process where learners evaluate and provide feedback on one another’s performance, which is critical to the student learning process. Earlier research has shown that it can improve student learning outcomes in various settings, including the setting of engineering education, in which collaborative teaching and learning activities are common. Peer assessment activities in computer-supported collaborative learning (CSCL) settings are becoming more and more common. When using digital technologies for performing these activities, much student data (e.g., peer feedback text entries) is generated automatically. These large data sets can be analyzed (through e.g., computational methods) and further used to improve our understanding of how students regulate their learning in CSCL settings in order to improve their conditions for learning by for example, providing in-time feedback. Yet there is currently a need to automatise the coding process of these large volumes of student text data since it is a very time- and resource consuming task. In this regard, the recent development in machine learning could prove beneficial. To understand how we can harness the affordances of machine learning technologies to classify student text data, this thesis examines the application of five models on a data set containing peer feedback from 231 students in the settings of a large technical university course. The models used to evaluate on the dataset are: the traditional models Multi Layer Perceptron (MLP), Decision Tree and the transformers-based models BERT, RoBERTa and DistilBERT. To evaluate each model’s performance, Cohen’s κ, accuracy, and F1-score were used as metrics. Preprocessing of the data was done by removing stopwords; then it was examined whether removing them improved the performance of the models. The results showed that preprocessing on the dataset only made the Decision Tree increase in performance while it decreased on all other models. RoBERTa was the model with the best performance on the dataset on all metrics used. Explainable artificial intelligence (XAI) was used on RoBERTa as it was the best performing model and it was found that the words considered as stopwords made a difference in the prediction. / Kamratbedömning är en process där eleverna utvärderar och ger feedback på varandras prestationer, vilket är avgörande för elevernas inlärningsprocess. Tidigare forskning har visat att den kan förbättra studenternas inlärningsresultat i olika sammanhang, däribland ingenjörsutbildningen, där samarbete vid undervisning och inlärning är vanligt förekommande. I dag blir det allt vanligare med kamratbedömning inom datorstödd inlärning i samarbete (CSCL). När man använder digital teknik för att utföra dessa aktiviteter skapas många studentdata (t.ex. textinlägg om kamratåterkoppling) automatiskt. Dessa stora datamängder kan analyseras (genom t.ex, beräkningsmetoder) och användas vidare för att förbättra våra kunskaper om hur studenterna reglerar sitt lärande i CSCL-miljöer för att förbättra deras förutsättningar för lärande. Men för närvarande finns det ett stort behov av att automatisera kodningen av dessa stora volymer av textdata från studenter. I detta avseende kan den senaste utvecklingen inom maskininlärning vara till nytta. För att förstå hur vi kan nyttja möjligheterna med maskininlärning teknik för att klassificera textdata från studenter, undersöker vi i denna studie hur vi kan använda fem modeller på en datamängd som innehåller feedback från kamrater till 231 studenter. Modeller som används för att utvärdera datasetet är de traditionella modellerna Multi Layer Perceptron (MLP), Decision Tree och de transformer-baserade modellerna BERT, RoBERTa och DistilBERT. För att utvärdera varje modells effektivitet användes Cohen’s κ, noggrannhet och F1-poäng som mått. Förbehandling av data gjordes genom att ta bort stoppord, därefter undersöktes om borttagandet av dem förbättrade modellernas effektivitet. Resultatet visade att förbehandlingen av datasetet endast fick Decision Tree att öka sin prestanda, medan den minskade för alla andra modeller. RoBERTa var den modell som presterade bäst på datasetet för alla mätvärden som användes. Förklarlig artificiell intelligens (XAI) användes på RoBERTa eftersom det var den modell som presterade bäst, och det visade sig att de ord som ansågs vara stoppord hade betydelse för prediktionen. Text classification Peer feedback Explainable Artificial Intelligence BERT RoBERTa DistilBERT Decision Tree MLP CSCL STEM education Textklassificering Feedback till kamrater Förklarig Artificiell Intelligens BERT RoBERTa DistilBERT Decision Tree MLP CSCL STEM-utbildning Computer and Information Sciences Data- och informationsvetenskap
4	Evaluating the robustness of DistilBERT to data shift in toxicity detection / Evaluering av DistilBERTs robusthet till dataskifte i en kontext av identifiering av kränkande språk Larsen, Caroline January 2022 (has links) With the rise of social media, cyberbullying and online spread of hate have become serious problems with devastating consequences. Mentimeter is an interactive presentation tool enabling the presentation audience to participate by typing their own answers to questions asked by the presenter. As the Mentimeter product is commonly used in schools, there is a need to have a strong toxicity detection program that filters out offensive and profane language. This thesis focuses on the topics of text pre-processing and robustness to datashift within the problem domain of toxicity detection for English text. Initially, it is investigated whether lemmatization, spelling correction, and removal of stop words are suitable strategies for pre-processing within toxicity detection. The pre-trained DistilBERT model was fine-tuned using an English twitter dataset that had been pre-processed using a number of different techniques. The results indicate that none of the above-mentioned strategies have a positive impact on the model performance. Lastly, modern methods are applied to train a toxicity detection model adjusted to anonymous Mentimeter user text data. For this purpose, a balanced Mentimeter dataset with 3654 datapoints was created and annotated by the thesis author. The best-performing model of the pre-processing experiment was iteratively fine-tuned and evaluated with an increasing amount of Mentimeter data. Based on the results, it is concluded that state-of-the-art performance can be achieved even when using relatively few datapoints for fine-tuning. Namely, when using around 500 − 2500 training datapoints, F1-scores between 0.90 and 0.94 were obtained on a Mentimeter test set. These results show that it is possible to create a customized toxicity detection program, with high performance, using just a small dataset. / I och med sociala mediers stora framtåg har allvarliga problem såsom nätmobbning och spridning av hat online blivit allt mer vanliga. Mentimeter är ett interaktivt presentationsverktyg som gör det möjligt för presentations-publiken att svara på frågor genom att formulera egna fritextsvar. Eftersom Mentimeter ofta används i skolor så finns det ett behov av ett välfungerande program som identifierar och filtrerar ut kränkande text och svordomar. Den här uppsatsen fokuserar på ämnena textbehandling och robusthet gentemot dataskifte i en kontext av identifiering av kränkande språk för engelsk text. Först undersöks det huruvida lemmatisering, stavningskorrigering, samt avlägsnande av stoppord är lämpliga textbehandlingstekniker i kontexten av identifiering av kränkande språk. Den förtränade DistilBERT-modellen används genom att finjustera dess parameterar med ett engelskt Twitter-dataset som har textbehandlats med ett antal olika tekniker. Resultaten indikerar att ingen av de nämnda strategierna har en positiv inverkan på modellens prestanda. Därefter användes moderna metoder för att träna en modell som kan identifiera kränkande text anpassad efter anonym data från Mentimeter. Ett balancerat Mentimeter-dataset med 3654 datapunkter skapades och annoterades av uppsatsförfattaren. Därefter finjusterades och evaluerades den bäst presterande modellen från textbehandlingsexperimentet iterativt med en ökande mängd Mentimeter-data. Baserat på resultaten drogs slutsatsen att toppmodern prestanda kan åstadkommas genom att använda relativt få datapunkter för träning. Nämligen, när ungefär 500 − 2500 träningsdatapunkter används, så uppnåddes F1-värden mellan 0.90 och 0.94 på ett test-set av Mentimeter-datasetet. Resultaten visar att det är möjligt att skapa en högpresterande modell som identifierar kränkande text, genom att använda ett litet dataset. Machine learning Natural Language Processing DistilBERT Toxicity Detection Profanity Detection Hate Speech Identification Text preprocessing Maskininlärning naturligtspråkbehandling DistilBERT identifiering av kränkande språk identifiering av svordomar textbehandling Computer and Information Sciences Data- och informationsvetenskap
5	Comparing Different Transformer Models’ Performance for Identifying Toxic Language Online Sundelin, Carl January 2023 (has links) There is a growing use of the internet and alongside that, there has been an increase in the use of toxic language towards other people that can be harmful to those that it targets. The usefulness of artificial intelligence has exploded in recent years with the development of natural language processing, especially with the use of transformers. One of the first ones was BERT, and that has spawned many variations including ones that aim to be more lightweight than the original ones. The goal of this project was to train three different kinds of transformer models, RoBERTa, ALBERT, and DistilBERT, and find out which one was best at identifying toxic language online. The models were trained on a handful of existing datasets that had labelled data as abusive, hateful, harassing, and other kinds of toxic language. These datasets were combined to create a dataset that was used to train and test all of the models. When tested against data collected in the datasets, there was very little difference in the overall performance of the models. The biggest difference was how long it took to train them with ALBERT taking approximately 2 hours, RoBERTa, around 1 hour and DistilBERT just over half an hour. To understand how well the models worked in a real-world scenario, the models were evaluated by labelling text as toxic or non-toxic on three different subreddits. Here, a larger difference in performance showed up. DistilBERT labelled significantly fewer instances as toxic compared to the other models. A sample of the classified data was manually annotated, and it showed that the RoBERTa and DistilBERT models still performed similarly to each other. A second evaluation was done on the data from Reddit and a threshold of 80% certainty was required for the classification to be considered toxic. This led to an average of 28% of instances being classified as toxic by RoBERTa, whereas ALBERT and DistilBERT classified an average of 14% and 11% as toxic respectively. When the results from the RoBERTa and DistilBERT models were manually annotated, a significant improvement could be seen in the performance of the models. This led to the conclusion that the DistilBERT model was the most suitable model for training and classifying toxic language of the lightweight models tested in this work. Artificial Intelligence Machine Learning Natural Language Processing Transformers RoBERTa DistilBERT ALBERT Toxic Language Identification Social Media Computer Sciences Datavetenskap (datalogi)
6	Distillation or loss of information? : The effects of distillation on model redundancy Sventickaite, Eva Elzbieta January 2022 (has links) The necessity for billions of parameters in large language models has lately been questioned as there are still unanswered questions regarding how information is captured in the networks. It could be argued that without this knowledge, there may be a tendency to overparametarize the models. In turn, the investigation of model redundancy and the methods which minimize it is important both to the academic and commercial entities. As such, the two main goals of this project were to, firstly, discover whether one of such methods, namely, distillation, reduces the redundancy of the language models without losing linguistic capabilities and, secondly, to determine whether the model architecture or multilingualism has a bigger effect on said reduction. To do so, ten models, both monolingual, multilingual, and their distilled counterparts, were evaluated layer and neuron-wise. In terms of layers, we have evaluated the layer correlation of all models by visualising heatmaps and calculating the average per layer similarity. For establishing the neuron-level redundancy, a classifier probe was applied on the model neurons, both the whole model and reduced by applying a clustering algorithm, and its performance was assessed for two tasks, Part-of-Speech (POS) and Dependency (DEP) tagging. To determine the distillation effects on the multilingualism of the models, we have investigated cross-lingual transfer for the same tasks and compared the results of the classifier as applied on multilingual models and one distilled variant in ten languages, nine Indo-European and one non-Indo-European. The results show that distillation reduces the number of redundant neurons at the cost of losing some of the linguistic knowledge. In addition, the redundancy in the distilled models is mainly attributed to the architecture on which it is based, with the multilingualism aspect having only a mild impact. Finally, the cross-lingual transfer experiments have shown that after distillation the model loses the ability to capture some languages more than others. In turn, the outcome of the project suggests that distillation could be applied to reduce the size of billion parameter models and is a promising method in terms of reducing the redundancy in current language models. distillation distillation effects distilbert distilmbert distilroberta distilgpt-2 distilled neurons redundancy redundancy in neural networks redundancy in language models neuron reduction in language models distilled language models

1

Page generated in 0.0273 seconds