Global ETD Search

91	Maskininlärning för automatisk extrahering av citat från recensioner : Med användning av BERT, Inter-Sentence Transformer och artificiella neuronnätverk / Machine learning for automatic extraction of quotes from reviews : Using BERT, Inter-Sentence Transformer, and artificial neural networks Hällgren, Clara, Kristiansson, Alexander January 2021 (has links) Att manuellt välja en eller flera meningar ur en filmrecension att använda som citat kan vara en tidskrävande uppgift. Denna rapport utvärderar övervakade maskininlärningsmodeller för att skapa en prototyp som automatiskt kan välja lämpliga citat ur recensioner. Utifrån resultatet av en litteraturstudie valdes två modeller att implementera och utvärdera på data bestående av filmrecensioner och tillhörande manuellt valda citat. Av arbetets två implementerade modeller, BERT med Inter-Sentence Transformer och BERT med ett artificiellt neuronnät, visade den sistnämnda marginellt bättre resultat. Modellerna utvärderades med ROUGE och jämfördes med tidigare studiers toppresultat inom automatisk textsummering. Slutsatsen är att de modeller som utvärderades inte presterar tillräckligt väl inom problemområdet för att motivera en driftsättning utan ytterligare utvecklingsarbete. Dock visar resultaten att det finns potential i att de utvärderade tillvägagångssätten delvis kan ersätta manuella val av citat i framtiden. / To choose a number of sentences from a movie review to use as a quote can be time consuming if done manually. This thesis evaluates supervised machine learning models to create a prototype that automatically can choose such quotes. The thesis chose, based on a literature study, two models to implement and evaluate on data consisting of movie reviews and their respective corresponding manually chosen quotes. Out of the thesis two implemented models, BERT with Inter-Sentence Transformer and BERT with an artificial neural network, the latter showed marginally better results. The models were evaluated with ROUGE and was compared with state-of-the-art models regarding automatic text summarization. The conclusion is that the models that were evaluated do not perform well enough for the problem to motivate full deployment without further development efforts. However, the results show that there is potential that the evaluated methods can partially replace manual labour when choosing quotes. Machine Learning BERT ATS Automatic Text Summary Extractive Summary Inter-Sentence Transformer Artificial Neural Network. Maskininlärning BERT ATS automatisk textsummering extraktiv summering Inter-Sentence Transformer artificiella neuronnät. Computer Engineering Datorteknik
92	Evaluation of Approaches for Representation and Sentiment of Customer Reviews / Utvärdering av tillvägagångssätt för representation och uppfattning om kundrecensioner Giorgis, Stavros January 2021 (has links) Classification of sentiment on customer reviews is a real-world application for many companies that offer text analytics and opinion extraction on customer reviews on different domains such as consumer electronics, hotels, restaurants, and car rental agencies. Natural Language Processing’s latest progress has seen the development of many new state-of-the-art approaches for representing the meaning of sentences, phrases, and words in the text using vector space models, so-called embeddings. In this thesis, we evaluated the most current and most popular text representation techniques against traditional methods as a baseline. The evaluation dataset consists of customer reviews from different domains with different lengths used by a text analysis company. Through a train dataset exploration, we evaluated which datasets were the most suitable for this specific task. Furthermore, we explored different techniques that could be used to alter a language model’s decisions without retraining it. Finally, all the methods were evaluated against their time performance and the resource requirements to present an overall experimental assessment that could potentially help the company decide which is the most appropriate technique to replace its system in a production environment. / Klassificeringen av attityd och känsloläge i kundrecensioner är en tillämpning med praktiskt värde för flera företag i marknadsanalysbranschen. Aktuell forskning i språkteknologi har etablerat vektorrum som standardrepresentation för ord, fraser och yttranden, så kallade embeddings. Denna uppsats utvärderar den senaste tidens mest framgångsrika textrepresentationsmodeller jämfört med mer traditionella vektorrum. Utvärdering görs genom att jämföra automatiska analyser med mänskliga bedömningar för kundrecensioner av varierande längd från olika domäner tillhandahållna av ett textanalysföretag. Inom ramen för studien har olika testmängder jämförts och olika sätt att modifera en språkmodells klassficering utan om träning. Alla modeller har också jämförts med avseende på resurs- och tidsåtgång för träning för att hjälpa uppdragsgivaren fatta beslut om vilken teknik som utgör den mest ändamålsenliga utvecklingsvägen för dess driftsatta system. machine learning nlp text analytics sentiment analysis transformers tfidf bow fasttext word2vec bert xlnet roberta maskininlärning nlp textanalys sentimentanalys transformatorer tfidf bow fasttext word2vec bert xlnet roberta Computer and Information Sciences Data- och informationsvetenskap
93	Speech Classification using Acoustic embedding and Large Language Models Applied on Alzheimer’s Disease Prediction Task Kheirkhahzadeh, Maryam January 2023 (has links) Alzheimer’s sjukdom är en neurodegenerativ sjukdom som leder till demens. Den kan börja tyst i de tidiga stadierna och fortsätta under åren till en allvarlig och obotlig fas. Språkstörningar uppstår ofta som ett av de tidiga symptomen och kan till slut leda till fullständig mutism i de avancerade stadierna av sjukdomen. Därför är tal- och språkbaserad analys en lovande och icke-invasiv metod för att upptäcka Alzheimer’s sjukdom i dess tidiga stadier. Vårt mål är att använda maskininlärning för att jämföra informationmängden hos språkliga representationer i stora språkmodeller och förtränade akustiska representationer. Såvitt vi vet är detta första gången som GPT-3 och wav2vec2.0 har använts tillsammans för klassificering av Alzheimer’s sjukdom. Dessutom utnyttjade vi för första gången en kombination av två stora språkmodeller, GPT-3 och BERT, för denna specifika uppgift. Genom att utvärdera vår metod på två datamängder på engelska och svenska kan vi också belysa språkskillnaderna mellan dessa två språk. / Alzheimer’s disease is a neurodegenerative disease that leads to dementia. It can begin silently in the early stages and progresses over the years to a severe and incurable stage. Language impairment often emerges as one of the early symptoms and can eventually progress to complete mutism in advanced stages of the disease. As a result, speech processing is a promising and non-invasive approach for detecting Alzheimer’s disease in its early stages. Our objective is to compare the informativeness levels of linguistic embedding derived from large language models and pre-trained acoustic embedding extracted using wav2vec2.0, in a machine learning-based approach. To the best of our knowledge, this is the first time that fusing GPT-3 text embedding and wav2vec2.0 acoustic embedding has been explored for Alzheimer’s disease classification. In addition, we utilized a combination of two large language models, GPT-3 and BERT, for the first time on this specific task. By evaluating our method on two datasets in English and Swedish, we can also highlight the language differences between these two languages. Speech classification Alzheimer’s disease detection GPT-3 BERT Text embedding Dementia wav2vec2.0 Klassificering av tal detektion av Alzheimer’s sjukdom GPT-3 BERT textinbäddning demens wav2vec2.0 Computer and Information Sciences Data- och informationsvetenskap
94	Performance Benchmarking and Cost Analysis of Machine Learning Techniques : An Investigation into Traditional and State-Of-The-Art Models in Business Operations / Prestandajämförelse och kostnadsanalys av maskininlärningstekniker : en undersökning av traditionella och toppmoderna modeller inom affärsverksamhet Lundgren, Jacob, Taheri, Sam January 2023 (has links) Eftersom samhället blir allt mer datadrivet revolutionerar användningen av AI och maskininlärning sättet företag fungerar och utvecklas på. Denna studie utforskar användningen av AI, Big Data och Natural Language Processing (NLP) för att förbättra affärsverksamhet och intelligens i företag. Huvudsyftet med denna avhandling är att undersöka om den nuvarande klassificeringsprocessen hos värdorganisationen kan upprätthållas med minskade driftskostnader, särskilt lägre moln-GPU-kostnader. Detta har potential att förbättra klassificeringsmetoden, förbättra produkten som företaget erbjuder sina kunder på grund av ökad klassificeringsnoggrannhet och stärka deras värdeerbjudande. Vidare utvärderas tre tillvägagångssätt mot varandra och implementationerna visar utvecklingen inom området. Modellerna som jämförs i denna studie inkluderar traditionella maskininlärningsmetoder som Support Vector Machine (SVM) och Logistisk Regression, tillsammans med state-of-the-art transformermodeller som BERT, både Pre-Trained och Fine-Tuned. Artikeln visar att det finns en avvägning mellan prestanda och kostnad vilket illustrerar problemet som många företag, som Valu8, står inför när de utvärderar vilket tillvägagångssätt de ska implementera. Denna avvägning diskuteras och analyseras sedan mer detaljerat för att utforska möjliga kompromisser från varje perspektiv i ett försök att hitta en balanserad lösning som kombinerar prestandaeffektivitet och kostnadseffektivitet. / As society is becoming more data-driven, Artificial Intelligence (AI) and Machine Learning are revolutionizing how companies operate and evolve. This study explores the use of AI, Big Data, and Natural Language Processing (NLP) in improving business operations and intelligence in enterprises. The primary objective of this thesis is to examine if the current classification process at the host company can be maintained with reduced operating costs, specifically lower cloud GPU costs. This can improve the classification method, enhance the product the company offers its customers due to increased classification accuracy, and strengthen its value proposition. Furthermore, three approaches are evaluated against each other, and the implementations showcase the evolution within the field. The models compared in this study include traditional machine learning methods such as Support Vector Machine (SVM) and Logistic Regression, alongside state-of-the-art transformer models like BERT, both Pre-Trained and Fine-Tuned. The paper shows a trade-off between performance and cost, showcasing the problem many companies like Valu8 stand before when evaluating which approach to implement. This trade-off is discussed and analyzed in further detail to explore possible compromises from each perspective to strike a balanced solution that combines performance efficiency and cost-effectiveness. Artificial Intelligence (AI) Machine Learning Big Data Natural Language Processing (NLP) Pre-Trained BERT Fine-Tuned BERT TF-IDF Logistic Regression Support Vector Machine (SVM) Cloud GPU Operating Costs Performance Efficiency Business Intelligence Computer and Information Sciences Data- och informationsvetenskap
95	L’utilité des médias sociaux pour la surveillance épidémiologique : une étude de cas de Twitter pour la surveillance de la maladie de Lyme Laison, Elda Kokoe Elolo 12 1900 (has links) La maladie de Lyme est la maladie transmise par tiques la plus répandue dans l’hémisphère du Nord. Le système de surveillance des cas humains de la maladie de Lyme est basé sur un système passif des cas par les professionnels de santé qui présente plusieurs failles rendant la surveillance incomplète. Avec l’expansion de l’usage de l’internet et des réseaux sociaux, des chercheurs proposent l’utilisation des données provenant des réseaux sociaux comme outil de surveillance, cette approche est appelée l’infodémiologie. Cette approche a été testée dans plusieurs études avec succès. L’objectif de ce mémoire est de construire une base de données à partir des tweets auto-déclarés, des tweets classifiés et étiquetés comme un cas potentiel de Lyme ou non à l’aide des modèles de classificateurs basés sur des transformateurs comme, BERTweet, DistilBERT et ALBERT. Pour ce faire, un total de 20 000 tweets en anglais en lien avec la maladie de Lyme sans restriction géographique de 2010 à 2022 a été collecté avec la plateforme API twitter. Nous avons procédé au nettoyage la base de données. Ensuite les données nettoyées ont été classifiées en binaire comme cas potentiels ou non de la maladie de Lyme sur la base des symptômes de la maladie comme mots-clés. À l’aide des modèles de classification basés sur les transformateurs, la classification automatique des données est évaluée en premier sans, et ensuite avec des émojis convertis en mots. Nous avons trouvé que les modèles de classification basés sur les transformateurs performent mieux que les modèles de classification classiques comme TF-IDF, Naive Bayes et autres ; surtout le modèle BERTweet a surpassé tous les modèles évalués avec un score F1 moyen de 89,3%, une précision de 97%, une exactitude de 90% et un rappel de 82,6%. Aussi l’incorporation des émojis dans notre base de données améliore la performance de tous les modèles d’au moins 5% mais BERTweet a une fois de plus le mieux performé avec une augmentation de tous les paramètres évalués. Les tweets en anglais sont majoritairement en provenance des États-Unis et pour contrecarrer cette prédominance, les futurs travaux devraient collecter des tweets de toutes langues en lien avec la maladie de Lyme surtout parce que les pays européens où la maladie de Lyme sont en émergence ne sont pas des pays anglophones. / Lyme disease is the most common tick-borne disease in the Northern Hemisphere. The surveillance system for human cases of Lyme disease has several flaws which make the surveillance incomplete. Nowadays with the extensive use of internet and social networks, researchers propose the use of data from social networks as a surveillance tool, this approach is called Infodemiology. This approach has been successfully tested in several studies. The aim of this thesis is to build a database from self-reported tweets, capable of classifying a tweet as a potential case of Lyme or not using BERT transformer-based classifier models. A total of 20,000 English tweets related to Lyme disease without geographical restriction from 2010 to 2022 were collected with twitter API. Then these data were cleaned and manually classified by binary classification as potential Lyme cases or not using as keywords the symptoms of Lyme disease; Also, emojis have been converted into words and integrated. Using classification models based on BERT transformers, the labeling of data as disease-related or non-disease-related is evaluated first without, and then with emojis. Transformer-based classification models performed better than conventional classification models, especially the BERTweet model outperformed all evaluated models with an average F1 score of 89.3%, precision of 97%, accuracy of 90%, and recall of 82.6%. Also, the incorporation of emojis in our database improves the performance of all models by at least 5% but BERTweet once again performed best with an increase in all parameters evaluated. Tweets in English are mostly from the United States and to counteract this predominance, future work should collect tweets of all languages related to Lyme disease especially because the European countries where Lyme disease are emerging are not English-speaking countries. maladie de Lyme réseaux sociaux twitter apprentissage automatique infodémiologie BERT emojis modèles de classification Lyme disease Social networks Twitter Machine learning Infodemiology BERT Classification models
96	[en] ON THE PROCESSING OF COURSE SURVEY COMMENTS IN HIGHER EDUCATION INSTITUTIONS / [pt] PROCESSAMENTO DE COMENTÁRIOS DE PESQUISAS DE CURSOS EM INSTITUIÇÕES DE ENSINO SUPERIOR HAYDÉE GUILLOT JIMÉNEZ 10 January 2022 (has links) [pt] A avaliação sistemática de uma Instituição de Ensino Superior (IES) fornece à sua administração um feedback valioso sobre vários aspectos da vida acadêmica, como a reputação da instituição e o desempenho individual do corpo docente. Em particular, as pesquisas com alunos são uma fonte de informação de primeira mão que ajuda a avaliar o desempenho do professor e a adequação do curso. Os objetivos principais desta tese são criar e avaliar modelos de análise de sentimento dos comentários dos alunos e estratégias para resumir os comentários dos alunos. A tese primeiro descreve duas abordagens para classificar a polaridade dos comentários dos alunos, ou seja, se eles são positivos, negativos ou neutros. A primeira abordagem depende de um dicionário criado manualmente que lista os termos que representam o sentimento a ser detectado nos comentários dos alunos. A segunda abordagem adota um modelo de representação de linguagem, que não depende de um dicionário criado manualmente, mas requer algum conjunto de teste anotado manualmente. Os resultados indicaram que a primeira abordagem superou uma ferramenta de linha de base e que a segunda abordagem obteve um desempenho muito bom, mesmo quando o conjunto de comentários anotados manualmente é pequeno. A tese então explora várias estratégias para resumir um conjunto de comentários com interpretações semelhantes. O desafio está em resumir um conjunto de pequenas frases, escritas por pessoas diferentes, que podem transmitir ideias repetidas. Como estratégias, a tese testou Market Basket Analysis, Topic Models, Text Similarity, TextRank e Entailment, adotando um método de inspeção humana para avaliar os resultados obtidos, uma vez que as métricas tradicionais de sumarização de textos se mostraram inadequadas. Os resultados sugerem que o agrupamento combinado com a estratégia baseada em centróide atinge os melhores resultados. / [en] The systematic evaluation of a Higher Education Institution (HEI) provides its administration with valuable feedback about several aspects of academic life, such as the reputation of the institution and the individual performance of teachers. In particular, student surveys are a first-hand source of information that help assess teacher performance and course adequacy. The primary goals of this thesis are to create and evaluate sentiment analysis models of students comments, and strategies to summarize students comments. The thesis first describes two approaches to classify the polarity of students comments, that is, whether they are positive, negative, or neutral. The first approach depends on a manually created dictionary that lists terms that represent the sentiment to be detected in the students comments. The second approach adopts a language representation model, which does not depend on a manually created dictionary, but requires some manually annotated test set. The results indicated that the first approach outperformed a baseline tool, and that the second approach achieved very good performance, even when the set of manually annotated comments is small. The thesis then explores several strategies to summarize a set of comments with similar interpretations. The challenge lies in summarizing a set of small sentences, written by different people, which may convey repeated ideas. As strategies, the thesis tested Market Basket Analysis, Topic Models, Text Similarity, TextRank, and Entailment, adopting a human inspection method to evaluate the results obtained, since traditional text summarization metrics proved inadequate. The results suggest that clustering combined with the centroid-based strategy achieves the best results. [pt] SIMILARIDADE [pt] TEXTRANK [pt] ENTAILMENT [pt] RESUMO DE COMENTARIOS [pt] MINERACAO DE DADOS EDUCACIONAIS [pt] BERT [pt] ANALISE DE SENTIMENTOS [pt] VISUALIZACAO DE DADOS [en] SIMILARITY [en] TEXTRANK [en] ENTAILMENT [en] COMMENT SUMMARIZATION [en] EDUCATIONAL DATA MINING [en] BERT [en] SENTIMENT ANALYSIS [en] DATA VISUALIZATION
97	Recommendation of Text Properties for Short Texts with the Use of Machine Learning : A Comparative Study of State-of-the-Art Techniques Including BERT and GPT-2 / Rekommendation av textegenskaper för korta texter med hjälp av maskininlärning : En jämförande studie av de senaste teknikerna inklusive BERT och GPT-2 Zapata, Luciano January 2023 (has links) Text mining has gained considerable attention due to the extensive usage ofelectronic documents. The significant increase in electronic document usagehas created a necessity to process and analyze them effectively. Rule-basedsystems have traditionally been used to evaluate short pieces of text, but theyhave limitations, including the need for significant manual effort to create andmaintain rules and a high risk of complex bugs. As a result, text classificationhas emerged as a promising solution for extracting meaning from short texts,which are defined as texts limited by a specific character count or word count.This study investigates the feasibility and effectiveness of text classification inclassifying short pieces of text according to their appropriate text properties,based on users’ intentions in the text. The study focuses on comparing twotransformer models, GPT-2 and BERT, in their ability to classify short texts.While other studies have compared these models in intention classificationof text, this study is unique in its examination of their performance onshort pieces of text in this specific context. This study uses user-labelleddata to fine-tune the models, which are then tested on a test dataset fromthe same source. The comparative analysis of the models indicates thatBERT generally outperforms GPT-2 in classifying users’ intentions basedon the appropriate text properties, with an F1-score of 0.68 compared toGPT-2’s F1-score of 0.51. However, GPT-2 performed better on certainclosely related classes, suggesting that both models capture interesting featuresof these classes. Furthermore, the results demonstrated that some classeswere accurately classified despite being context-dependent and positionedwithin longer sentences, indicating that the models likely capture features ofthese classes and facilitate their classification. Both models show promisingpotential as classification models for short texts based on users’ intentions andtheir associated text properties. However, further research may be necessary toimprove their accuracy. Suggestions for enhancing their performance includeutilizing more recent versions of GPT, such as GPT-3 or GPT-4, optimizinghyperparameters, adjusting preprocessing methods, and adopting alternativeapproaches to handle data imbalance. Additionally, testing the models ondatasets from diverse domains with more intricate contexts could providegreater insight into their limitations. / Textutvinning har fått stor uppmärksamhet på grund av den omfattande användningen av elektroniska dokument. Den betydande ökningen av användningen av elektroniska dokument har skapat ett behov av att bearbeta och analysera dem på ett effektivt sätt. Regelbaserade system har traditionellt använts för att utvärdera korta textstycken, men de har begränsningar, bland annat behovet av betydande manuellt arbete för att skapa och upprätthålla regler och en hög risk för komplexa fel. Som ett resultat av detta har textklassificering framstått som en lovande lösning för att utvinna mening ur korta texter, som definieras som texter som begränsas av ett visst antal tecken eller ord. I den här studien undersöks om textklassificering är genomförbar och effektiv när det gäller att klassificera korta textstycken enligt deras lämpliga textegenskaper, baserat på användarnas intentioner i texten. Studien fokuserar på att jämföra två transformatormodeller, GPT-2 och BERT, i deras förmåga att klassificera korta texter. Även om andra studier har jämfört dessa modeller vid avsiktsklassificering av text, är denna studie unik i sin undersökning av deras prestanda för korta textstycken i detta specifika sammanhang. I studien används användarmärkta data för att finjustera modellerna, som sedan testas på ett testdataset från samma källa. Den jämförande analysen av modellerna visar att BERT generellt sett presterar bättre än GPT-2 när det gäller att klassificera användarnas avsikter baserat på lämpliga textegenskaper, med ett F1-värde på 0,68 jämfört med GPT-2:s F1-värde på 0,51. GPT-2 presterade dock bättre på vissa närbesläktade klasser, vilket tyder på att båda modellerna fångar intressanta egenskaper hos dessa klasser. Dessutom visade resultaten att vissa klasser klassificerades korrekt trots att de var kontextberoende och placerade i längre meningar, vilket tyder på att modellerna sannolikt fångar upp egenskaper hos dessa klasser och underlättar deras klassificering. Båda modellerna visar lovande potential som klassificeringsmodeller för korta texter baserade på användarnas intentioner och deras tillhörande textegenskaper. Ytterligare forskning kan dock vara nödvändig för att förbättra deras noggrannhet. Förslag för att förbättra deras prestanda är bland annat att använda nyare versioner av GPT, till exempel GPT-3 eller GPT-4, optimera hyperparametrar, justera förbehandlingsmetoder och anta alternativa metoder för att hantera obalans i data. Om modellerna dessutom testas på dataset från olika områden med mer komplicerade sammanhang kan man få en bättre insikt i deras begränsningar. Text classification Short texts Deep Learning BERT GPT GPT-2 Transformers Natural Language Processing Textklassificering Korta Texter Djupinlärning BERT GPT GPT-2 Transformatorer Naturlig språkbehandling Computer and Information Sciences Data- och informationsvetenskap
98	Multi-Class Emotion Classification for Interactive Presentations : A case study on how emotional sentiment analysis can help end users better convey intended emotion Andersson, Charlotte January 2022 (has links) Mentimeter is one of the fastest-growing startups in Sweden. They are an audience engagement platform that allows users to create interactive presentations and engage an audience. As online information spreads increasingly faster, methods of analyzing, understanding, and categorizing information are developing and improving rapidly. Natural Language Processing (NLP) is the ability to break down input, for instance, text or audio, and process it using technologies such as computational linguistics and statistical learning, machine learning, and deep learning models. This thesis aimed to investigate if a tool that applies multi-class emotion classification of text could benefit end users when they are creating presentations using Mentimeter. A case study was conducted where a pre-trained BERT base model that had been fine-tuned and trained to the GoEmotions data set was applied as a tool to Mentimeter’s presentation software and then evaluated by end users. The results found that the tool was accurate; however, overall was not helpful for end users. For future research, improvements such as including emotions/tones that are more related to presentations would make the tool more applicable to presentations and would be helpful according to end users. / Mentimeter är en av Sveriges snabbast växande startupbolag som erbjuder en tjänst där användare kan skapa interaktiva presenationer och engagera sin publik. Medan infomration online sprids allt snabbare utvecklas och förbättras metoder för att kunna analysera, förstå och kategorisera information. Natural Language Processing (NLP) är förmågan att kunna bryta ner indata, som text och ljud, och processera det med hjälp av teknologier som datalingvistik och statistisk inlärnings, maskininlärnings, och djupinlärnings modeller. Syftet med denna uppsats var att undersöka om ett verktyg som applicerar multi-class emotion classification med text skulle gynna användare när de skapar presentation med Mentimeter. En fallstudie utfördes där en förtränad BERT modell som hade finjusterats och tränats på GoEmotions dataset applicerades som ett verktyg på Mentimeters programvara som användare sen fick utvärdera. Resultaten visar att verktyget var motsvarande men övergripande fann användarna att verktyget inte var hjälpsamt. För framtida forskning skulle förbättringar av verktyget som att använda känslor/toner som är mer relterade till presentationer göra verktyget mer hjälpsamt enligt användare. Interactive Presentations Audience Engagement Platform Emotion Prediction Natural Language Processing Text Classification Sentiment Analysis BERT Case Study Interaktiva Presenationer Publikengagemangsplattform Förutsägelse av Känslor Natural Language Processing Textklassificering Attitydanalys BERT Fallstudie Computer and Information Sciences Data- och informationsvetenskap
99	DistillaBSE: Task-agnostic distillation of multilingual sentence embeddings : Exploring deep self-attention distillation with switch transformers Bubla, Boris January 2021 (has links) The recent development of massive multilingual transformer networks has resulted in drastic improvements in model performance. These models, however, are so large they suffer from large inference latency and consume vast computing resources. Such features hinder widespread adoption of the models in industry and some academic settings. Thus there is growing research into reducing their parameter count and increasing their inference speed, with significant interest in the use of knowledge distillation techniques. This thesis uses the existing approach of deep self-attention distillation to develop a task-agnostic distillation of the language agnostic BERT sentence embedding model. It also explores the use of the Switch Transformer architecture in distillation contexts. The result is DistilLaBSE, a task-agnostic distillation of LaBSE used to create a 10 times faster version of LaBSE, whilst retaining over 99% cosine similarity of its sentence embeddings on a holdout test from the same domain as the training samples, namely the OpenSubtitles dataset. It is also shown that DistilLaBSE achieves similar scores when embedding data from two other domains, namely English tweets and customer support banking data. This faster version of LaBSE allows industry practitioners and resourcelimited academic groups to apply a more convenient version of LaBSE to their various applications and research tasks. / Den senaste utvecklingen av massiva flerspråkiga transformatornätverk har resulterat i drastiska förbättringar av modellprestanda. Dessa modeller är emellertid så stora att de lider av stor inferenslatens och förbrukar stora datorresurser. Sådana funktioner hindrar bred spridning av modeller i branschen och vissa akademiska miljöer. Således växer det forskning om att minska deras parametrar och öka deras inferenshastighet, med stort intresse för användningen av kunskapsdestillationstekniker. Denna avhandling använder det befintliga tillvägagångssättet med djup uppmärksamhetsdestillation för att utveckla en uppgiftsagnostisk destillation av språket agnostisk BERT- innebördmodell. Den utforskar också användningen av Switch Transformerarkitekturen i destillationskontexter. Resultatet är DistilLaBSE, en uppgiftsagnostisk destillation av LaBSE som används för att skapa en 10x snabbare version av LaBSE, samtidigt som man bibehåller mer än 99 % cosinuslikhet i sina meningsinbäddningar på ett uthållstest från samma domän som träningsproverna, nämligen OpenSubtitles dataset. Det visas också att DistilLaBSE uppnår liknande poäng när man bäddar in data från två andra domäner, nämligen engelska tweets och kundsupportbankdata. Denna snabbare version av LaBSE tillåter branschutövare och resursbegränsade akademiska grupper Transformers Knowledge Distillation Natural Language Processing Switch Transformers Transformatorer kunskapsdestillation naturlig bearbetning av språk switchtransformatorer Computer and Information Sciences Data- och informationsvetenskap
100	Textbrytning av mäklartexter och slutpris : Med BERT, OLS och Elman regressionsnätverk / Text mining of broker texts and sold price : Using BERT, OLS and Elman regression network Fjellström, Emil, Challita, Johan January 2021 (has links) Att estimera slutpriset av en bostadsförsäljning är en komplex uppgift där mäklartexter som beskriver bostäder är en vital del av försäljningen. Denna rapport undersöker om det går att använda mäklartexter för att generera mer träffsäkra estimeringar med maskininlärningsmodeller. Två olika maskininlärningsmodeller implementerades som resultat av en litteraturstudie och utvärderades mot Boolis existerande OLS-modell. De implementerade modellerna är OLS-BERT och Elman regressionsnätverk. OLS-BERT visade en generell förbättring jämfört med Boolis OLS-modell, i synnerhet av F-statistik där mätvärdet sjönk med 99,8 procent. P-värdet i T-statistik för “vista” (utsikten) har sjunkit från 37,7 till 1 procent. Elman regressionsnätverket sänkte Boolis OLS-modells MAPE från 58,5 till 6,62 procent. Modellerna utvärderades med åtta olika mått varav de för studiens viktigaste är MAPE, MAE, F-statistik och T-statistik. Genom att bryta ut attribut ur mäklartexter kan modellen förklara signifikansen hos indata, samt få något mer träffsäkra estimeringar av slutpriset av en bostadsförsäljning. Resultaten visar att det är en intressant metod som med fördel kan vidare utforskas. / Estimating the price of home sales is a complex task, where broker texts describing the housing is a vital part of the sales. This study explore the possibility to use broker texts to generate more accurate estimations using machine learning models. Two different machine learning models were implemented as a result of a literature study and evaluated against Booli’s existing OLS-model. The implemented machine learning models are OLS-BERT and an Elman regression network. OLS-BERT showed a general improvement compared to Booli’s OLS-model, in particular the F-statistic were 99.8 percent lower than Booli’s OLS-model. The p-value in T-statistic for “vista” was 37.7 percent with Booli’s OLS-model and 1 percent with OLS-BERT. The Elman regression network lowered the MAPE of Booli’s OLS-model from 58.5 to 6.62 percent. All models were evaluated using eight different measures, of which the most important for this study is MAPE, MAE, F-statistic, and T-statistic. The conclusion is that by mining attributes from broker texts the models can explain the significance of the input and generate somewhat more accurate estimations of the home sales price of sale. The results show that this is an interesting method that should be further explored. BERT OLS Elman regression machine learning supervised models unsupervised models broker texts attributes BERT OLS Elman regression maskininlärning övervakade modeller oövervakade modeller mäklartexter attribut

Search results