1 |
What the BERT? : Fine-tuning KB-BERT for Question Classification / Vad i BERT? : Finjustering av KB-BERT för frågeklassificeringCervall, Jonatan January 2021 (has links)
This work explores the capabilities of KB-BERT on the downstream task of Question Classification. The TREC data set for Question Classification with the Li and Roth taxonomy was translated to Swedish, by manually correcting the output of Google’s Neural Machine Translation. 500 new data points were added. The fine-tuned model was compared with a similarly trained model based on Multilingual BERT, a human evaluation, and a simple rule-based baseline. Out of the four methods of this work, the Swedish BERT model (SwEAT- BERT) performed the best, achieving 91.2% accuracy on TREC-50 and 96.2% accuracy on TREC-6. The performance of the human evaluation was worse than both BERT models, but doubt is cast on how fair this comparison is. SwEAT-BERTs results are competitive even when compared to similar models based on English BERT. This furthers the notion that the only roadblock in training language models for smaller languages is the amount of readily available training data. / Detta arbete utforskar hur bra den svenska BERT-modellen, KB-BERT, är på frågeklassificering. BERT är en transformermodell som skapar kontextuella, bidirektionella ordinbäddningar. Det engelska datasetet för frågeklassificering, TREC, översattes till svenska och utökades med 500 nya datapunkter. Två BERT-modeller finjusterades på detta nya TREC-dataset, en baserad på KB-BERT och en baserad på Multilingual BERT, en flerspråkig variant av BERT tränad på data från 104 språk (däribland svenska). En regel-baserad modell byggdes som en nedre gräns på problemet, och en mänsklig klassificeringsstudie utfördes som jämförelse. BERT-modellen baserad på KB-BERT (SwEAT-BERT) uppnådde 96.2% korrekthet på TREC med 6 kategorier, och 91.2% korrekthet på TREC med 50 kategorier. Den mänskliga klassificeringen uppnådde sämre resultat än båda BERT-modellerna, men det är tvivelaktigt hur rättvis denna jämförelse är. SwEAT-BERT presterade bäst av metoderna som testades i denna studie, och konkurrenskraftigt i jämförelse med engelska BERT-modeller finjusterade på det engelska TREC-datasetet. Detta resultat stärker uppfattningen att tillgänglighet till träningsdata är det enda som står i vägen för starkare språkmodeller för mindre språk.
2 |
Re-ranking search results with KB-BERT / Omrankning av sökresultat med KB-BERTViðar Kristjánsson, Bjarki January 2022 (has links)
This master thesis aims to determine if a Swedish BERT model can improve a BM25 search by re-ranking the top search results. We compared a standard BM25 search algorithm with a more complex algorithm composed of a BM25 search followed by re-ranking the top 10 results by a BERT model. The BERT model used is KB-BERT, a publicly available neural network model built by the National Library of Sweden. We fine-tuned this model to solve the specific task of evaluating the relevancy of search results. A new Swedish search evaluation dataset was automatically generated from Wikipedia text to compare the algorithms. The search evaluation dataset is a standalone product and can be beneficial for evaluating other search algorithms on Swedish text in the future. The comparison of the two algorithms resulted in a slightly better ranking for the BERT re-ranking algorithm. These results align with similar studies using an English BERT and an English search evaluation dataset. / Denna masteruppsats syftar till att avgöra om en svensk BERT-modell kan förbättra en BM25-sökning genom att ranka om de bästa sökresultaten. Vi jämförde en standard BM25-sökalgoritm med en mer komplex algoritm som består av en BM25-sökning följt av omrankning av de 10 bästa resultaten med en BERT-modell. BERT-modellen som används är KB-BERT, en allmänt tillgänglig neural nätverksmodell byggd av Kungliga biblioteket. Vi finjusterade den här modellen för att lösa den specifika uppgiften att utvärdera sökresultatens relevans. En ny svensk datamängd för utvärdering av sökresultat genererades automatiskt från Wikipedia-text för att jämföra algoritmerna. Datamängden är en fristående produkt och kan vara till nytta för att utvärdera andra sökalgoritmer på svensk text i framtiden. Jämförelsen av de två algoritmerna resulterade i en något bättre rankning för BERT-omrankningsalgoritmen. Dessa resultat överensstämmer med liknande studier som använder en engelsk BERT och en engelsk datamängd för utvärdering av sökresultat.
3 |
Period Drama : Punctuation restoration in Swedish through fine- tuned KB-BERT / Dags att sätta punkt : Återställning av skiljetecken genom finjusterad KB-BERTSinderwing, John January 2021 (has links)
Presented here is a method for automatic punctuation restoration in Swedish using a BERT model. The method is based on KB-BERT, a publicly available, neural network language model pre-trained on a Swedish corpus by National Library of Sweden. This model has then been fine-tuned for this specific task using a corpus of government texts. With a lower-case and unpunctuated Swedish text as input, the model is supposed to return a grammatically correct punctuated copy of the text as output. A successful solution to this problem brings benefits for an array of NLP domains, such as speech-to-text and automated text. Only the punctuation marks period, comma and question marks were considered for the project, due to a lack of data for more rare marks such as semicolon. Additionally, some marks are somewhat interchangeable with the more common, such as exclamation points and periods. Thus, the data set had all exclamation points replaced with periods. The fine-tuned Swedish BERT model, dubbed prestoBERT, achieved an overall F1-score of 78.9. The proposed model scored similarly to international counterparts, with Hungarian and Chinese models obtaining F1-scores of 82.2 and 75.6 respectively. As further comparison, a human evaluation case study was carried out. The human test group achieved an overall F1-score of 81.7, but scored substantially worse than prestoBERT on both period and comma. Inspecting output sentences from the model and humans show satisfactory results, despite the difference in F1-score. The disconnect seems to stem from an unnecessary focus on replicating the exact same punctuation used in the test set, rather than providing any of the number of correct interpretations. If the loss function could be rewritten to reward all grammatically correct outputs, rather than only the one original example, the performance could improve significantly for both prestoBERT and the human group. / Här presenteras en metod för automatisk återinföring av skiljetecken på svenska med hjälp av ett neuralt nätverk i formen av en BERT-modell. Metoden bygger på KB-BERT, en allmänt tillgänglig språkmodell, tränad på ett svensk korpus, av Kungliga Biblioteket. Denna modell har sedan finjusterats för den här specifika uppgiften med hjälp av ett korpus av offentliga texter från landsting och dylikt. Med svensk text utan versaler och skiljetecken som inmatning, ska modellen returnera en kopia av texten där korrekta skiljetecken har placerats ut på rätta platser. En framgångsrik modell ger fördelar för en rad domäner inom neurolingvistisk programmering, såsom tal- till- texttranskription och automatiserad textgenerering. Endast skiljetecknen punkt, kommatecken och frågetecken tas i beaktande i projektet på grund av en brist på data för de mer sällsynta skiljetecknen såsom semikolon. Dessutom är vissa skiljetecken någorlunda utbytbara mot de vanligaste tre, såsom utropstecken mot punkt. Således har datasetets alla utropstecken ersatts med punkter. Den finjusterade svenska BERT-modellen, kallad prestoBERT, fick en övergripande F1-poäng på 78,9. De internationella motsvarande modellerna för ungerska och kinesiska fick en övergripande F1-poäng på 82,2 respektive 75,6. Det tyder på att prestoBERT är på en liknande nivå som toppmoderna motsvarigheter. Som ytterligare jämförelse genomfördes en fallstudie med mänsklig utvärdering. Testgruppen uppnådde en övergripande F1-poäng på 81,7, men presterade betydligt sämre än prestoBERT på både punkt och kommatecken. Inspektion av utdata från modellen och människorna visar tillfredsställande resultat från båda, trots skillnaden i F1-poäng. Skillnaden verkar härstamma från ett onödigt fokus på att replikera exakt samma skiljetecken som används i indatan, snarare än att återge någon av de många korrekta tolkningar som ofta finns. Om loss-funktionen kunde skrivas om för att belöna all grammatiskt korrekt utdata, snarare än bara originalexemplet, skulle prestandan kunna förbättras avsevärt för både prestoBERT såväl som den mänskliga gruppen.
4 |
Aspektbaserad Sentimentanalys för Business Intelligence inom E-handeln / Aspect-Based Sentiment Analysis for Business Intelligence in E-commerceEriksson, Albin, Mauritzon, Anton January 2022 (has links)
Many companies strive to make data-driven decisions. To achieve this, they need to explore new tools for Business Intelligence. The aim of this study was to examine the performance and usability of aspect-based sentiment analysis as a tool for Business Intelligence in E-commerce. The study was conducted in collaboration with Ellos Group AB which supplied anonymous customer feedback data. The implementation consists of two parts, aspect extraction and sentiment classification. The f irst part, aspect extraction, was implemented using dependency parsing and various aspect grouping techniques. The second part, sentiment classification, was implemented using the language model KB-BERT, a Swedish version of the BERT model. The method for aspect extraction achieved a satisfactory precision of 79,5% but only a recall of 27,2%. Moreover, the result for sentiment classification was unsatisfactory with an accuracy of 68,2%. Although the results underperform expectations, we conclude that aspect-based sentiment analysis in general is a great tool for Business Intelligence. Both as a means of generating customer insights from previously unused data and to increase productivity. However, it should only be used as a supportive tool and not to replace existing processes for decision-making. / Många företag strävar efter att fatta datadrivna beslut. För att åstadkomma detta behöver de utforska nya metoder för Business Intelligence. Syftet med denna studie var att undersöka prestandan och användbarheten av aspektbaserad sentimentanalys som ett verktyg för Business Intelligence inom e-handeln. Studien genomfördes i samarbete med Ellos Group AB som tillhandahöll data bestående av anonym kundfeedback. Implementationen består av två delar, aspektextraktion och sentimentklassificering. Aspektextraktion implementerades med hjälp av dependensparsning och olika aspektgrupperingstekniker. Sentimentklassificering implementerades med hjälp av språkmodellen KB-BERT, en svensk version av BERT. Metoden för aspektextraktion uppnådde en tillfredsställande precision på 79,5% men endast en recall på 27,2%. Resultatet för sentimentklassificering var otillfredsställande med en accuracy på 68,2%. Även om resultaten underpresterar förväntningarna drar vi slutsatsen att aspektbaserad sentimentanalys i allmänhet är ett bra verktyg för Business Intelligence. Både som ett sätt att generera kundinsikter från tidigare oanvända data och som ett sätt att öka produktiviteten. Det bör dock endast användas som ett stödjande verktyg och inte ersätta befintliga processer för beslutsfattande.
5 |
Punctuation Restoration as Post-processing Step for Swedish Language Automatic Speech RecognitionGupta, Ishika January 2023 (has links)
This thesis focuses on the Swedish language, where punctuation restoration, especially as a postprocessing step for the output of Automatic Speech Recognition (ASR) applications, needs furtherresearch. I have collaborated with NewsMachine AB, a company that provides large-scale mediamonitoring services for its clients, for which it employs ASR technology to convert spoken contentinto text.This thesis follows an approach initially designed for high-resource languages such as English. Themethod is based on KB-BERT, a pre-trained Swedish neural network language model developedby the National Library of Sweden. The project uses KB-BERT with a Bidirectional Long-ShortTerm Memory (BiLSTM) layer on top for the task of punctuation restoration. The model is finetuned using the TED Talk 2020 dataset in Swedish, which is acquired from OPUS (an open-sourceparallel corpus). The punctuation marks comma, period, question mark, and colon are considered for this project. A comparative analysis is conducted between two KB-BERT models: bertbase-swedish-cased and albert-base-swedish-cased-alpha. The fine-tuned Swedish BERT-BiLSTMmodel, trained on 5 classes, achieved an overall F1-score of 81.6%, surpassing the performance ofthe ALBERT-BiLSTM model, which was also trained on 5 classes and obtained an overall F1-scoreof 66.6%. Additionally, the BERT-BiLSTM model, trained on 4 classes (excluding colon), outperformed prestoBERT, an existing model designed for the same task in Swedish, with an overallF1-score of 82.8%. In contrast, prestoBERT achieved an overall F1-score of 78.9%.As a further evaluation of the model’s performance on ASR transcribed text, noise was injectedbased on four probabilities (0.05, 0.1, 0.15, 0.2) into a copy of the test data in the form of threeword-level errors (deletion, substitution, and insertion). The performance of the BERT-BiLSTMmodel substantially decreased for all the errors as the probability of noise injected increased. Incontrast, the model still performed comparatively better when dealing with deletion errors as compared to substitution and insertion errors. Lastly, the data resources received from NewsMachineAB were used to perform a qualitative assessment of how the model performs in punctuating realtranscribed data as compared to human judgment.
6 |
Predictive maintenance using NLP and clustering support messagesYilmaz, Ugur January 2022 (has links)
Communication with customers is a major part of customer experience as well as a great source of data mining. More businesses are engaging with consumers via text messages. Before 2020, 39% of businesses already use some form of text messaging to communicate with their consumers. Many more were expected to adopt the technology after 2020[1]. Email response rates are merely 8%, compared to a response rate of 45% for text messaging[2]. A significant portion of this communication involves customer enquiries or support messages sent in both directions. According to estimates, more than 80% of today’s data is stored in an unstructured format (suchas text, image, audio, or video) [3], with a significant portion of it being stated in ambiguous natural language. When analyzing such data, qualitative data analysis techniques are usually employed. In order to facilitate the automated examination of huge corpora of textual material, researchers have turned to natural language processing techniques[4]. Under the light of shared statistics above, Billogram[5] has decided that support messages between creditors and recipients can be mined for predictive maintenance purposes, such as early identification of an outlier like a bug, defect, or wrongly built feature. As one sentence goal definition, Billogram is looking for an answer to ”why are people reaching out to begin with?” This thesis project discusses implementing unsupervised clustering of support messages by benefiting from natural language processing methods as well as performance metrics of results to answer Billogram’s question. The research also contains intent recognition of clustered messages in two different ways, one automatic and one semi-manual, the results have been discussed and compared. LDA and manual intent assignment approach of the first research has 100 topics and a 0.293 coherence score. On the other hand, the second approach produced 158 clusters with UMAP and HDBSCAN while intent recognition was automatic. Creating clusters will help identifying issues which can be subjects of increased focus, automation, or even down-prioritizing. Therefore, this research lands in the predictive maintenance[9] area. This study, which will get better over time with more iterations in the company, also contains the preliminary work for ”labeling” or ”describing”clusters and their intents.
7 |
Miljöpartiet and the never-ending nuclear energy debate : A computational rhetorical analysis of Swedish climate policyDickerson, Claire January 2022 (has links)
The domain of rhetoric has changed dramatically since its inception as the art of persuasion. It has adapted to encompass many forms of digital media, including, for example, data visualization and coding as a form of literature, but the approach has frequently been that of an outsider looking in. The use of comprehensive computational tools as a part of rhetorical analysis has largely been lacking. In this report, we attempt to address this lack by means of three case studies in natural language processing tasks, all of which can be used as part of a computational approach to rhetoric. At this same moment in time, it is becoming all the more important to transition to renewable energy in order to keep global warming under 1.5 degrees Celsius and ensure that countries meet the conditions of the Paris Agreement. Thus, we make use of speech data on climate policy from the Swedish parliament to ground these three analyses in semantic textual similarity, topic modeling, and political party attribution. We find that speeches are, to a certain extent, consistent within parties, given that a slight majority of most semantically similar speeches come from the same party. We also find that some of the most common topics discussed in these speeches are nuclear energy and the Swedish Green party, purported environmental risks due to renewable energy sources, and the job market. Finally, we find that though pairs of speeches are semantically similar, party rhetoric on the whole is generally not unique enough for speeches to be distinguishable by party. These results then open the door for a broader exploration of computational rhetoric for Swedish political science in the future.
Page generated in 0.0238 seconds