Global ETD Search

61	Generating Terraform Configuration Files with Large Language Models / Att skapa Terraform-konfigurationsfiler med stora språkmodeller Bonde, Oskar January 2022 (has links) This thesis explores how large language models can be used to generate configuration files for Terraform from natural language descriptions. Few-shot and fine-tuning paradigms are evaluated on decoder-only models of varying size, including the state-of-the-art Codex model. The generated configuration files are evaluated with regard to functional correctness on a custom dataset using Terraform, to account for the large space of functionally equivalent configuration files. Results show that the largest model Codex is very capable at generating configuration files given an English description of network infrastructure even without fine-tuning. The result could be a useful tool for engineers who know Terraform fundamentals and have experience with the cloud platforms: AWS, GCP, or Azure. A future study could fine-tune Codex for Terraform using OpenAI's API or create an open source Codex-replication by fine-tuning the GPT-3 replication OPT, which in turn can be \hbox{fine-tuned}. / Denna avhandling undersöker hur stora språkmodeller kan användas till att generera konfigurationsfiler för Terraform med hjälp av språkbeskrivningar. Både few-shot och fine-tuning paradigm utvärderas på decoder-only modeller i olika storlekar, inklusive Codex. För att ta hänsyn till konfigurationsfiler som i utseende ser olika ut men som är funktionellt ekvivalenta utvärderas konfigurationsfilerna utifrån deras funktion. Resultaten visar att Codex, som är den största modellen, har förmågan att generera konfigurationsfiler givet en engelsk beskrivning av nätverksinfrastruktur, trots att Codex inte har undergått fine-tuning. Resultatet kan vara ett användbart verktyg för ingenjörer som har grundläggande kunskap om Terraform och erfarenhet av molnplattformarna: AWS, GCP eller Azure. En framtida studie skulle kunna träna Codex för Terraform med OpenAI:s API eller skapa en Codex-kopia genom att träna GPT-3 kopian OPT som i sin tur kan bli tränad för Terraform. Terraform Transformer models Generating configuration files Large Language Models Codex Terraform Transformer-modeller Generera konfigurationsfiler Stora språkmodeller Codex Computer Sciences Datavetenskap (datalogi)
62	Language Models as Evaluators : A Novel Framework for Automatic Evaluation of News Article Summaries / Språkmodeller som Utvärderare : Ett Nytt Ramverk för Automatiserad Utvärdering av Nyhetssammanfattningar Helgesson Hallström, Celine January 2023 (has links) The advancements in abstractive summarization using Large Language Models (LLMs) have brought with it new challenges in evaluating the quality and faithfulness of generated summaries. This thesis explores a human-like automated method for evaluating news article summaries. By leveraging two LLMs with instruction-following capabilities (GPT-4 and Claude), the aim is to examine to what extent the quality of summaries can be measured by predictions of an LLM. The proposed framework involves defining specific attributes of desired summaries, which are used to design generation prompts and evaluation questions. These questions are presented to the LLMs in natural language during evaluation to assess of various summary qualities. To validate the effectiveness of the evaluation method, an adversarial approach is employed, in which a dataset comprising summaries with distortions related to various summary attributes is generated. In an experiment, the two LLMs evaluate the adversarial dataset, and their ability to detect known distortions is measured and analyzed. The findings suggest that the LLM-based evaluations demonstrate promise in detecting binary qualitative issues, such as incorrect facts. However, the reliability of the zero-shot evaluation varies depending on the evaluating LLM and the specific questions used. Further research is required to validate the accuracy and generalizability of the results, particularly in subjective dimensions where the results of this thesis are inconclusive. Nonetheless, this thesis provides insights that can serve as a foundation for future advancements in the field of automatic text evaluation. / De framsteg som gjorts inom abstrakt sammanfattning med hjälp av stora språkmodeller (LLM) har medfört nya utmaningar när det gäller att utvärdera kvaliteten och sanningshalten hos genererade sammanfattningar. Detta examensarbete utforskar en mänskligt inspirerad automatiserad metod för att utvärdera sammanfattningar av nyhetsartiklar. Genom att dra nytta av två LLM:er med instruktionsföljande förmågor (GPT-4 och Claude) är målet att undersöka i vilken utsträckning kvaliteten av sammanfattningar kan bestämmas med hjälp av språkmodeller som utvärderare. Det föreslagna ramverket innefattar att definiera specifika egenskaper hos önskade sammanfattningar, vilka används för att utforma genereringsuppmaningar (prompts) och utvärderingsfrågor. Dessa frågor presenteras för språkmodellerna i naturligt språk under utvärderingen för att bedöma olika kvaliteter hos sammanfattningar. För att validera utvärderingsmetoden används ett kontradiktoriskt tillvägagångssätt där ett dataset som innefattar sammanfattningar med förvrängningar relaterade till olika sammanfattningsattribut genereras. I ett experiment utvärderar de två språkmodellerna de motstridiga sammanfattningar, och deras förmåga att upptäcka kända förvrängningar mäts och analyseras. Resultaten tyder på att språkmodellerna visar lovande resultat vid upptäckt av binära kvalitativa problem, såsom faktafel. Dock varierar tillförlitligheten hos utvärderingen beroende på vilken språkmodell som används och de specifika frågorna som ställs. Ytterligare forskning krävs för att validera tillförlitligheten och generaliserbarheten hos resultaten, särskilt när det gäller subjektiva dimensioner där resultaten är osäkra. Trots detta ger detta arbete insikter som kan utgöra en grund för framtida framsteg inom området för automatisk textutvärdering. Natural Language Processing Large Language Models Automatic Text Evaluation Text Summarization Multilingualism Naturlig Språkbehandling Stora Språkmodeller Automatisk Textutvärdering Textsammanfattning Flerspråkighet Computer and Information Sciences Data- och informationsvetenskap
63	Cookie Monsters : Using Large Language Models to Measure GDPR Compliance in Cookie Banners Automatically Otterström, Marcus, Palonkorpi, Oliver January 2023 (has links) There is a widespread problem of cookie banners not being compliant with the General Data Protection Regulation (GDPR), which negatively impacts user experience and violates personal data rights. To mitigate this issue, strides need to be made in violation detection to assist developers, designers, lawyers, organizations, and authorities in designing and enforcing GDPR-compliant cookie banners. In this thesis, we present a novel method and an open-source tool for automatically analyzing the GDPR compliance of cookie banners. The tool uniquely leverages large language models together with static code analysis to locate and analyze any cookie banner, using only the website address as input. Informed by the Design Science Research methodology, our research process involved interviews with GDPR legal experts and a thorough review of current literature in order to understand the problem context and define the objectives for our solution. After an initial version of the tool was created, an evaluation was performed by a GDPR legal expert. The feedback revealed that even at this early development stage, the tool approaches the capabilities of a trained eye, which illustrates its potential. Furthermore, our proposed method is generalizable and can be used under many domains to solve various problems (e.g., more generalized web scraping). However, further development and testing with the help of legal experts is required to enhance the tool's accuracy and validity. cookie banners gdpr compliance consent large language models design science research Information Systems, Social aspects
64	Characterizing, classifying and transforming language model distributions Kniele, Annika January 2023 (has links) Large Language Models (LLMs) have become ever larger in recent years, typically demonstrating improved performance as the number of parameters increases. This thesis investigates how the probability distributions output by language models differ depending on the size of the model. For this purpose, three features for capturing the differences between the distributions are defined, namely the difference in entropy, the difference in probability mass in different slices of the distribution, and the difference in the number of tokens covering the top-p probability mass. The distributions are then put into different distribution classes based on how they differ from the distributions of the differently-sized model. Finally, the distributions are transformed to be more similar to the distributions of the other model. The results suggest that classifying distributions before transforming them, and adapting the transformations based on which class a distribution is in, improves the transformation results. It is also shown that letting a classifier choose the class label for each distribution yields better results than using random labels. Furthermore, the findings indicate that transforming the distributions using entropy and the number of tokens in the top-p probability mass makes the distributions more similar to the targets, while transforming them based on the probability mass of individual slices of the distributions makes the distributions more dissimilar. Large Language Models (LLMs) GPT BERT NLP deep learning machine learning computational linguistics language technology
65	Keeping tabs on GPT-SWE : Classifying toxic output from generative language models for Swedish text generation / Monitorering av GPT-SWE : Klassificering av toxisk text från svenska generativa språkmodeller Pettersson, Isak January 2022 (has links) Disclaimer: This paper contains content that can be perceived as offensive or upsetting. Considerable progress has been made in Artificial intelligence (AI) and Natural language processing (NLP) in the last years. Neural language models (LM) like Generative pre-trained transformer 3 (GPT-3) show impressive results, generating high-quality text seemingly written by a human. Neural language models are already applied in society for example in creating chatbots or assisting with writing documents. As generative LMs are trained on large amounts of data from all kinds of sources, they can pick up toxic traits. GPT-3 has for instance been shown to generate text with social biases, racism, sexism and toxic language. Therefore, filtering for toxic content is necessary to safely deploy models like GPT-3. GPT-3 is trained on and can generate English text data, but similar models for smaller languages have recently emerged. GPT-SWE is a novel model based on the same technical principles as GPT-3, able to generate Swedish text. Much like GPT-3, GPT-SWE has issues with generating toxic text. A promising approach for addressing this problem is to train a separate toxicity classification model for classifying the generated text as either toxic or safe. However, there is a substantial need for more research on toxicity classification for lower resource languages and previous studies for the Swedish language are sparse. This study explores the use of toxicity classifiers to filter Swedish text generated from GPT-SWE. This is investigated by creating and annotating a small Swedish toxicity dataset which is used to fine-tune a Swedish BERT model. The best performing toxicity classifier created in this work cannot be considered useful in an applied scenario. Nevertheless, the results encourage continued studies on BERT models that are pre-trained and fine-tuned in Swedish to create toxicity classifiers. The results also highlight the importance of qualitative datasets for fine-tuning and demonstrate the difficulties of toxicity annotation. Furthermore, expert annotators, distinctive well-defined guidelines for annotation and fine-grained labels are recommended. The study also provides insights into the potential for active learning methods in creating datasets in languages with lower resources. Implications and potential solutions regarding toxicity in generative LMs are also discussed. / Varning: Denna studie omfattar innehåll som kan uppfattas som stötande eller upprörande. Betydande framsteg har gjorts inom Artificiell intelligens (AI) och Språkteknologi (NLP) de senaste åren. Utvecklingen av Neurala språkmodeller har fört med sig framgångsrika modeller likt Generative pre-trained transformer 3 (GPT-3) som visat på imponerande resultat i att generera högkvalitativ text, till synes skriven av en människa. Språkmodeller tillämpas redan på flera platser i samhället till exempel för att hjälpa till med att skriva dokument eller för att skapa chatbots. Eftersom språkmodeller tränas på stora mängder data från många typer av källor kan de fånga upp toxiska egenskaper. GPT-3 har till exempel visat sig generera text med sociala fördomar, rasism, sexism och toxiskt språk. En nödvändighet för att säkert distribuera modeller som GPT-3 inkluderar således filtrering av toxiskt innehåll. GPT-3 är tränad på och kan generera engelsk textdata men liknande modeller för mindre språk har nyligen börjat dyka upp. GPT-SWE är en ny modell som bygger på samma tekniska principer som GPT-3 men kan generera svensk text. Likt GPT-3 så har GPT-SWE problem med genererad toxisk text. För att lösa problemen med toxicitet är ett lovande tillvägagångssätt att träna en separat toxicitetsklassificeringsmodell för att klassificera genererad text som toxisk eller säker. Det finns dock en brist på tidigare studier om detta för det svenska språket och det finns ett stort behov av mer forskning kring toxicitetsklassificering för språk med lägre resurser. Följaktligen undersöker detta projekt möjligheterna att använda toxicitetsklassificerare för att filtrera genererad text från svenska språkmodeller. Detta undersöks genom att skapa och annotera ett litet svenskt toxicitets-dataset som används för att finjustera en svensk BERT-modell. Den bäst presterande toxicitetsklassificeraren som skapades inom detta arbete kan inte anses användbar i ett tillämpat scenario. Resultaten uppmuntrar dock fortsatta studier på BERT-modeller förtränade och finjusterade på svenska för att skapa toxicitetsklassificerare. Resultatet skiftar också ytterligare fokus mot vikten av ett kvalitativt dataset för finjustering och påvisar svårigheterna med toxicitets-annotering. Vidare rekommenderas expert-annoterare, distinkta väldefinierade riktlinjer för annotering samt användandet av fler och mer specificerade kategorier för toxicitet. Arbetet ger dessutom insikter om potentialen för metoder som aktiv inlärning för att skapa dataset inom språk med lägre resurser. Fortsättningsvis diskuteras också implikationer och potentiella lösningar angående toxicitet i språkmodeller. Active learning Classification Language models Natural Language Processing Swedish Transformers Toxic text Aktiv inlärning Klassificering Språkmodeller Språkteknologi Svenska Transformer nätverk Toxisk text Computer Sciences Datavetenskap (datalogi)
66	Monolingual and Cross-Lingual Survey Response Annotation Zhao, Yahui January 2023 (has links) Multilingual natural language processing (NLP) is increasingly recognized for its potential in processing diverse text-type data, including those from social media, reviews, and technical reports. Multilingual language models like mBERT and XLM-RoBERTa (XLM-R) play a pivotal role in multilingual NLP. Notwithstanding their capabilities, the performance of these models largely relies on the availability of annotated training data. This thesis employs the multilingual pre-trained model XLM-R to examine its efficacy in sequence labelling to open-ended questions on democracy across multilingual surveys. Traditional annotation practices have been labour-intensive and time-consuming, with limited automation attempts. Previous studies often translated multilingual data into English, bypassing the challenges and nuances of native languages. Our study explores automatic multilingual annotation at the token level for democracy survey responses in five languages: Hungarian, Italian, Polish, Russian, and Spanish. The results reveal promising F1 scores, indicating the feasibility of using multilingual models for such tasks. However, the performance of these models is closely tied to the quality and nature of the training set. This research paves the way for future experiments and model adjustments, underscoring the importance of refining training data and optimizing model techniques for enhanced classification accuracy. transfer learning zero-shot cross-lingual transfer model-based transfer multilingual pre-trained language models sequence labeling open-ended questions democracy
67	Self-Reflection on Chain-of-Thought Reasoning in Large Language Models / Självreflektion över Chain-of-Thought-resonerande i stora språkmodeller Praas, Robert January 2023 (has links) A strong capability of large language models is Chain-of-Thought reasoning. Prompting a model to ‘think step-by-step’ has led to great performance improvements in solving problems such as planning and question answering, and with the extended output it provides some evidence about the rationale behind an answer or decision. In search of better, more robust, and interpretable language model behavior, this work investigates self-reflection in large language models. Here, self-reflection consists of feedback from large language models to medical question-answering and whether the feedback can be used to accurately distinguish between correct and incorrect answers. GPT-3.5-Turbo and GPT-4 provide zero-shot feedback scores to Chain-of-Thought reasoning on the MedQA (medical questionanswering) dataset. The question-answering is evaluated on traits such as being structured, relevant and consistent. We test whether the feedback scores are different for questions that were either correctly or incorrectly answered by Chain-of-Thought reasoning. The potential differences in feedback scores are statistically tested with the Mann-Whitney U test. Graphical visualization and logistic regressions are performed to preliminarily determine whether the feedback scores are indicative to whether the Chain-of-Thought reasoning leads to the right answer. The results indicate that among the reasoning objectives, the feedback models assign higher feedback scores to questions that were answered correctly than those that were answered incorrectly. Graphical visualization shows potential for reviewing questions with low feedback scores, although logistic regressions that aimed to predict whether or not questions were answered correctly mostly defaulted to the majority class. Nonetheless, there seems to be a possibility for more robust output from self-reflecting language systems. / En stark förmåga hos stora språkmodeller är Chain-of-Thought-resonerande. Att prompta en modell att tänka stegvis har lett till stora prestandaförbättringar vid lösandet av problem som planering och frågebesvarande, och med den utökade outputen ger det en del bevis rörande logiken bakom ett svar eller beslut. I sökandet efter bättre, mer robust och tolk bart beteende hos språkmodeller undersöker detta arbete självreflektion i stora språkmodeller. Forskningsfrågan är: I vilken utsträckning kan feedback från stora språkmodeller, såsom GPT-3.5-Turbo och GPT-4, på ett korrekt sätt skilja mellan korrekta och inkorrekta svar i medicinska frågebesvarande uppgifter genom användningen av Chainof-Thought-resonerande? Här ger GPT-3.5-Turbo och GPT-4 zero-shot feedback-poäng till Chain-ofThought-resonerande på datasetet för MedQA (medicinskt frågebesvarande). Frågebesvarandet bör vara strukturerat, relevant och konsekvent. Feedbackpoängen jämförs mellan två grupper av frågor, baserat på om dessa besvarades korrekt eller felaktigt i första hand. Statistisk testning genomförs på skillnaden i feedback-poäng med Mann-Whitney U-testet. Grafisk visualisering och logistiska regressioner utförs för att preliminärt avgöra om feedbackpoängen är indikativa för huruvida Chainof-Thought-resonerande leder till rätt svar. Resultaten indikerar att bland resonemangsmålen tilldelar feedbackmodellerna fler positiva feedbackpoäng till frågor som besvarats korrekt än de som besvarats felaktigt. Grafisk visualisering visar potential för granskandet av frågor med låga feedbackpoäng, även om logistiska regressioner som syftade till att förutsäga om frågorna besvarades korrekt eller inte för det mesta majoritetsklassen. Icke desto mindre verkar det finnas potential för robustare från självreflekterande språksystem. Large language models Chain-of-Thought reasoning Metareasoning Question answering Selfcorrection Ethical AI Stora språkmodeller Chain-of-Thought-resonemang Metareasoning Frågesvar Självkorrigering Etisk AI Computer and Information Sciences Data- och informationsvetenskap
68	An Empirical Study on Using Codex for Automated Program Repair Zhao, Pengyu January 2023 (has links) This thesis explores the potential of Codex, a pre-trained Large Language Model (LLM), for Automated Program Repair (APR) by assessing its performance on the Defects4J benchmark that includes real-world Java bugs. The study aims to provide a comprehensive understanding of Codex’s capabilities and limitations in generating syntactically and semantically equivalent patches for defects, as well as evaluating its ability to handle defects with different levels of importance and complexity. Additionally, we aim to compare the performance of Codex with other LLMs in the APR domain. To achieve these objectives, we employ a systematic methodology that includes prompt engineering, Codex parameter adjustment, code extraction, patch verification, and Abstract Syntax Tree (AST) comparison. We successfully verified 528 bugs in Defects4J, which represents the highest number among other studies, and achieved 53.98% of plausible and 26.52% correct patches. Furthermore, we introduce the elle-elle-aime framework, which extends the RepairThemAll for Codex-based APR and is adaptable for evaluating other LLMs, such as ChatGPT and GPT-4. The findings of this empirical study provide valuable insights into the factors that impact Codex’s performance on APR, helping to create new prompt strategies and techniques that improve research productivity. / Denna avhandling utforskar potentialen hos Codex, en förtränad LLM, för APR genom att utvärdera dess prestanda på Defects4J-benchmarket som inkluderar verkliga Java-buggar. Studien syftar till att ge en omfattande förståelse för Codex förmågor och begränsningar när det gäller att generera syntaktiskt och semantiskt ekvivalenta patchar för defekter samt att utvärdera dess förmåga att hantera defekter med olika nivåer av betydelse och komplexitet. Dessutom är vårt mål att jämföra prestanda hos Codex med andra LLM inom APR-området. För att uppnå dessa mål använder vi en systematisk metodik som inkluderar prompt engineering, justering av Codex-parametrar, kodextraktion, patchverifiering och jämförelse av AST. Vi verifierade framgångsrikt 528 buggar i Defects4J, vilket representerar det högsta antalet bland andra studier, och uppnådde 53,98% plausibla och 26,52% korrekta patchar. Vidare introducerar vi elle-elle-aime ramverket, som utvidgar RepairThemAll för Codex-baserad APR och är anpassningsbart för att utvärdera andra LLM, såsom ChatGPT och GPT-4. Resultaten av denna empiriska studie ger värdefulla insikter i de faktorer som påverkar Codex prestanda på APR och hjälper till att skapa nya promptstrategier och tekniker som förbättrar forskningsproduktiviteten. Automated Program Repair Codex Large Language Models Defects4J Patch Generation Prompt Engineering Automatiserad Programreparation Codex Storskaliga Språkmodeller Defects4J Patchgenerering Promptteknik Computer and Information Sciences Data- och informationsvetenskap
69	Low-resource suicide ideation and depression detection with multitask learning and large language models Breau, Pierre-William 08 1900 (has links) Nous évaluons des méthodes de traitement automatique du langage naturel (TALN) pour la détection d’idées suicidaires, de la dépression et de l’anxiété à partir de publications sur les médias sociaux. Comme les ensembles de données relatifs à la santé mentale sont rares et généralement de petite taille, les méthodes classiques d’apprentissage automatique ont traditionnellement été utilisées dans ce domaine. Nous évaluons l’effet de l’apprentissage multi-tâche sur la détection d’idées suicidaires en utilisant comme tâches auxiliaires des ensembles de données disponibles publiquement pour la détection de la dépression et de l’anxiété, ainsi que la classification d’émotions et du stress. Nous constatons une hausse de la performance de classification pour les tâches de détection d’idées suicidaires, de la dépression et de l’anxiété lorsqu’elles sont entraînées ensemble en raison de similitudes entre les troubles de santé mentale à l’étude. Nous observons que l’utilisation d’ensembles de données publiquement accessibles pour des tâches connexes peut bénéficier à la détection de problèmes de santé mentale. Nous évaluons enfin la performance des modèles ChatGPT et GPT-4 dans des scénarios d’apprentissage zero-shot et few-shot. GPT-4 surpasse toutes les autres méthodes testées pour la détection d’idées suicidaires. De plus, nous observons que ChatGPT bénéficie davantage de l’apprentissage few-shot, car le modèle fournit un haut taux de réponses non concluantes si aucun exemple n’est présenté. Enfin, une analyse des faux négatifs produits par GPT-4 pour la détection d’idées suicidaires conclut qu’ils sont dus à des erreurs d’étiquetage plutôt qu’à des lacunes du modèle. / In this work we explore natural language processing (NLP) methods to suicide ideation, depression, and anxiety detection in social media posts. Since annotated mental health data is scarce and difficult to come by, classical machine learning methods have traditionally been employed on this type of task due to the small size of the datasets. We evaluate the effect of multi-task learning on suicide ideation detection using publicly-available datasets for depression, anxiety, emotion and stress classification as auxiliary tasks. We find that classification performance of suicide ideation, depression, and anxiety is improved when trained together because of the proximity between the mental disorders. We observe that publicly-available datasets for closely-related tasks can benefit the detection of certain mental health conditions. We then perform classification experiments using ChatGPT and GPT-4 using zero-shot and few-shot learning, and find that GPT-4 obtains the best performance of all methods tested for suicide ideation detection. We further observe that ChatGPT benefits the most from few-shot learning as it struggles to give conclusive answers when no examples are provided. Finally, an analysis of false negative results for suicide ideation output by GPT-4 concludes that they are due to labeling errors rather than mistakes from the model. Modèles de langage Idées suicidaires Classification de textes Apprentissage multitâche Language models Suicide ideation Text classification Multitask learning
70	Repairing Swedish Automatic Speech Recognition / Korrigering av Automatisk Taligenkänning för Svenska Rehn, Karla January 2021 (has links) The quality of automatic speech recognition has increased dramatically the last few years, but the performance for low and middle resource languages such as Swedish is still far from optimal. In this project a language model trained on large written corpora called KB-BERT is utilized to improve the quality of transcriptions for Swedish. The large language model is inserted as a repairing module after the automatic speech recognition, aiming to repair the original output into a transcription more closely resembling the ground truth by using a sequence to sequence translating approach. Two automatic speech recognition models are used to transcribe the speech, one of the models are developed in this project using the Kaldi framework, the other model is Microsoft’s Azure Speech to text platform. The performance of the translator is evaluated with four different datasets, three consisting of read speech and one of spontaneous speech. The spontaneous speech and one of the read datasets include both native and non-native speakers. The performance is measured by three different metrics, word error rate, a weighted word error rate and a semantic similarity. The repairs improve the transcriptions of two of the read speech datasets significantly, decreasing the word error rate from 13.69% to 3.05% and from 36.23% to 21.17%. The repairs improve the word error rate from 44.38% to 44.06% on the data with spontaneous speech, and fail on the last read dataset, instead increasing the word error rate. The lower performance on the latter is likely due to lack of data. / Automatisk taligenkänning har förbättrats de senaste åren, men för små språk såsom svenska är prestandan fortfarande långt ifrån optimal. Det här projektet använder KB-BERT, en neural språkmodell tränad på stora mängder skriven text, för att förbättra kvalitén på transkriptioner av svenskt tal. Transkriptionerna kommer från två olika taligenkänningsmodeller, dels en utvecklad i det här projektet med hjälp av mjukvarubiblioteket Kaldi, dels Microsoft Azures plattform för tal till text. Transkriptionerna repareras med hjälp av en sequence-to-sequence översättningsmodell, och KB-BERT används för att initiera modellen. Översättningen sker från den urpsrungliga transkriptionen från en av tal-till-text-modellerna till en transkription som är mer lik den korrekta, faktiska transkriptionen. Kvalitéen på reparationerna evalueras med tre olika metriker, på fyra olika dataset. Tre av dataseten är läst tal och det fjärde spontant, och det spontana talet samt ett av de lästa dataseten kommer både från talare som har svenska som modersmål, och talare som har det som andraspråk. De tre metrikerna är word error rate, en viktad word error rate, samt ett mått på semantisk likhet. Reparationerna förbättrar transkriptionerna från två av de lästa dataseten markant, och sänker word error rate från 13.69% till 3.05% och från 36.23% till 21.17%. På det spontana talet sänks word error rate från 44.38% till 44.06%. Reparationerna misslyckas på det fjärde datasetet, troligen på grund av dess lilla storlek. Automatic speech recognition Dialogue systems Language models ASR Repair Automatisk taligenkänning Dialogsystem Språkmodeller Reparation av taligenkänning Computer and Information Sciences Data- och informationsvetenskap

Search results