Global ETD Search

201	Grammatical Error Correction for Learners of Swedish as a Second Language Nyberg, Martina January 2022 (has links) Grammatical Error Correction refers to the task of automatically correcting errors in written text, typically with respect to texts written by learners of a second language. The work in this thesis implements and evaluates two methods to Grammatical Error Correction for Swedish. In addition, the proposed methods are compared to an existing, rule-based system. Previous research on GEC for the Swedish language is limited and has not yet utilized the potential of neural networks. The first method implemented in this work is based on a neural machine translation approach, training a Transformer model to translate erroneous text into a corrected version. A parallel dataset containing artificially generated errors is created to train the model. The second method utilizes a Swedish version of the pre-trained language model BERT to estimate the likelihood of potential corrections in an erroneous text. Employing the SweLL gold corpus consisting of essays written by learners of Swedish, the proposed methods are evaluated using GLEU and through a manual evaluation based on the types of errors and their corresponding corrections found in the essays. The results show that the two methods correct approximately the same amount of errors, while differing in terms of which error types that are best handled. Specifically, the translation approach has a wider coverage of error types and is superior for syntactical and punctuation errors. In contrast, the language model approach yields consistently higher recall and outperforms the translation approach with regards to lexical and morphological errors. To improve the results, future work could investigate the effect of increased model size and amount of training data, as well as the potential in combining the two methods. grammatical error correction swedish machine translation language modeling machine learning
202	Towards terminology-based keyword extraction Krassow, Cornelius January 2022 (has links) The digitization of information has provided an overflow of data in many areas of society, including the clinical sector. However, confidentiality issues concerning the privacy of both clinicians and patients have hampered research into how to best deal with this kind of "clinical" data. An example of clinical data which can be found in abundance are Electronic Medical Records, or EMR for short. EMRs contain information about a patient's medical history, such as summarizes of earlier visits, prescribed medications and more. These EMRs can be quite extensive and reading them in full can be time-consuming, especially when considering the often hectic nature of hospital work. Giving clinicians the ability to gain insight into what information is of importance when dealing with extensive EMRs might be very useful. Keyword extraction are methods developed in the field of language technology that aim to automatically extract the most important terms or phrases from a text. Applying these methods on EMR data successfully could help provide the clinicians with a helping hand when short on time. Clinical data are very domain-specific however, requiring different kinds of expert knowledge depending on what field of medicine is being investigated. Due to the scarcity of research on not only clinical keyword extractions but clinical data as a whole, foundational groundwork in how to best deal with the domain-specific demands of a clinical keyword extractor need to be laid. By exploring how the two unsupervised approaches YAKE! and KeyBERT deal with the domain-specific task of implant-focused keyword extraction, the limitations of clinical keyword extraction are tested. Furthermore, the performance of a general BERT model in comparison to a model finetuned on domain-specific data is investigated. Finally, an attempt is made to create a domain-specific set of gold-standard keywords by combining unsupervised approaches to keyword extraction is made. The results show that unsupervised approaches perform poorly when dealing with domain-specific tasks that do not have a clear correlation to the main domain of the data. Finetuned BERT models seem to perform almost as well as a general model when tasked with implant-focused keyword extraction, although further research is needed. Finally, the use of unsupervised approaches in conjunction with manual evaluations provided by domain experts show some promise. NLP BERT keywords keyword extraction Electronic Medical Records Swedish patient journals
203	En utvärdering av tjänster för taligenkänning och textsammanfattning och möjligheter att skapa undertexter i filmer. / An evaluation of services for speech recognition and text summarizationand the ability to create subtitles in movies. Kjerrström, Linus, Pham Huy, Hoang January 2022 (has links) Att skapa undertexter till filmer är idag ett hantverk som är en tidskrävande process. Företaget Firstlight Media textar cirka 200 filmer per vecka helt manuellt och var av en film tar cirka 4–6 timmar att färdigställa. Skulle man kunna automatisera delar av processen för att undertexta filmer finns det möjlighet att spara resurser. Arbetet gick ut på att utvärdera om det är möjligt att automatisera vissa delar i processen för att undertexta filmer. För att undersöka detta gjordes en litteraturstudie på tidigare arbeten som gjorts inom områdena för automatisk taligenkänning och textsammanfattning. Efter studien testades ett antal tjänster för både taligenkänning och textsammanfattning på tre olika filmer för att utvärdera ifall tjänsterna anses lämpliga att använda vid undertextning av filmer. Testandet av tjänsterna ledde till en analys av resultaten som visade att textsammanfattning ej var lämpligt dock var taligenkänning till viss del användbart för att automatisera transkribering av det talade språket i filmerna. / Creating subtitles for movies is today a handcraft that is a time-consuming process. The company Firstlight Media creates subtitles for around 200 movies per week manuelly where each movie usually takes around 4 – 6 hours to finish. If steps in the subtitling process could be automated, then there is the possibilty of saving resources. The work consisted of evaluating whether it is possible to automate parts of the process for subtitling movies. To analyze this, a literature study was done on previous work done in the areas of automatic speech recognition and text summary. After the study, a few services for both speech recognition and text summarizers were tested on three different movies to evaluate whether the services are considered suitable to use while subtitling movies. The testing of the services led to an analysis of the results which showed that text summarizer was not suitable, however, speech recognition was to some extent useful for automating the transcription of the spoken language in the movies. NLP text summarizer speech recognition natural language NLP textsammanfattning taligenkänning naturliga språk
204	Coreference Resolution for Swedish / Koreferenslösning för svenska Vällfors, Lisa January 2022 (has links) This report explores possible avenues for developing coreference resolution methods for Swedish. Coreference resolution is an important topic within natural language processing, as it is used as a preprocessing step in various information extraction tasks. The topic has been studied extensively for English, but much less so for smaller languages such as Swedish. In this report we adapt two coreference resolution algorithms that were originally used for English, for use on Swedish texts. One algorithm is entirely rule-based, while the other uses machine learning. We have also annotated a Swedish dataset to be used for training and evaluation. Both algorithms showed promising results and as none clearly outperformed the other we can conclude that both would be good candidates for further development. For the rule-based algorithm more advanced rules, especially ones that could incorporate some semantic knowledge, was identified as the most important avenue of improvement. For the machine learning algorithm more training data would likely be the most beneficial. For both algorithms improved detection of mention spans would also help, as this was identified as one of the most error-prone components. / I denna rapport undersöks möjliga metoder för koreferenslösning för svenska. Koreferenslösning är en viktig uppgift inom språkteknologi, eftersom det utgör ett första steg i många typer av informationsextraktion. Uppgiften har studerats utförligt för flera större språk, framförallt engelska, men är ännu relativt outforskad för svenska och andra mindre språk. I denna rapport har vi anpassat två algoritmer som ursprungligen utvecklades för engelska för användning på svensk text. Den ena algoritmen bygger på maskininlärning och den andra är helt regelbaserad. Vi har också annoterat delar av Talbankens korpus med koreferensrelationer, för att användas för träning och utvärdering av koreferenslösningsalgoritmer. Båda algoritmerna visade lovande resultat, och ingen var tydligt bättre än den andra. Bägge vore därför lämpliga alternativ för vidareutveckling. För ML-algoritmen vore mer träningsdata den viktigaste punkten för förbättring, medan den regelbaserade algoritmen skulle kunna förbättras med mer komplexa regler, för att inkorporera exempelvis semantisk information i besluten. Ett annat viktigt utvecklingsområde är identifieringen av de fraser som utvärderas för möjlig koreferens, eftersom detta steg introducerade många fel i bägge algoritmerna. Natural language processing Information extraction Machine learning Random forests Coreference resolution Språkteknologi informationsextraktion maskininlärning beslutsträdsinlärning koreferenslösning Computer Sciences Datavetenskap (datalogi)
205	Text ranking based on semantic meaning of sentences / Textrankning baserad på semantisk betydelse hos meningar Stigeborn, Olivia January 2021 (has links) Finding a suitable candidate to client match is an important part of consultant companies work. It takes a lot of time and effort for the recruiters at the company to read possibly hundreds of resumes to find a suitable candidate. Natural language processing is capable of performing a ranking task where the goal is to rank the resumes with the most suitable candidates ranked the highest. This ensures that the recruiters are only required to look at the top ranked resumes and can quickly get candidates out in the field. Former research has used methods that count specific keywords in resumes and can make decisions on whether a candidate has an experience or not. The main goal of this thesis is to use the semantic meaning of the text in the resumes to get a deeper understanding of a candidate’s level of experience. It also evaluates if the model is possible to run on-device and if the database can contain a mix of English and Swedish resumes. An algorithm was created that uses the word embedding model DistilRoBERTa that is capable of capturing the semantic meaning of text. The algorithm was evaluated by generating job descriptions from the resumes by creating a summary of each resume. The run time, memory usage and the ranking the wanted candidate achieved was documented and used to analyze the results. When the candidate who was used to generate the job description is ranked in the top 10 the classification was considered to be correct. The accuracy was calculated using this method and an accuracy of 68.3% was achieved. The results show that the algorithm is capable of ranking resumes. The algorithm is able to rank both Swedish and English resumes with an accuracy of 67.7% for Swedish resumes and 74.7% for English. The run time was fast enough at an average of 578 ms but the memory usage was too large to make it possible to use the algorithm on-device. In conclusion the semantic meaning of resumes can be used to rank resumes and possible future work would be to combine this method with a method that counts keywords to research if the accuracy would increase. / Att hitta en lämplig kandidat till kundmatchning är en viktig del av ett konsultföretags arbete. Det tar mycket tid och ansträngning för rekryterare på företaget att läsa eventuellt hundratals CV:n för att hitta en lämplig kandidat. Det finns språkteknologiska metoder för att rangordna CV:n med de mest lämpliga kandidaterna rankade högst. Detta säkerställer att rekryterare endast behöver titta på de topprankade CV:erna och snabbt kan få kandidater ut i fältet. Tidigare forskning har använt metoder som räknar specifika nyckelord i ett CV och är kapabla att avgöra om en kandidat har specifika erfarenheter. Huvudmålet med denna avhandling är att använda den semantiska innebörden av texten iCV:n för att få en djupare förståelse för en kandidats erfarenhetsnivå. Den utvärderar också om modellen kan köras på mobila enheter och om algoritmen kan rangordna CV:n oberoende av om CV:erna är på svenska eller engelska. En algoritm skapades som använder ordinbäddningsmodellen DistilRoBERTa som är kapabel att fånga textens semantiska betydelse. Algoritmen utvärderades genom att generera jobbeskrivningar från CV:n genom att skapa en sammanfattning av varje CV. Körtiden, minnesanvändningen och rankningen som den önskade kandidaten fick dokumenterades och användes för att analysera resultatet. När den kandidat som användes för att generera jobbeskrivningen rankades i topp 10 ansågs klassificeringen vara korrekt. Noggrannheten beräknades med denna metod och en noggrannhet på 68,3 % uppnåddes. Resultaten visar att algoritmen kan rangordna CV:n. Algoritmen kan rangordna både svenska och engelska CV:n med en noggrannhet på 67,7 % för svenska och 74,7 % för engelska. Körtiden var i genomsnitt 578 ms vilket skulle möjliggöra att algoritmen kan köras på mobila enheter men minnesanvändningen var för stor. Sammanfattningsvis kan den semantiska betydelsen av CV:n användas för att rangordna CV:n och ett eventuellt framtida arbete är att kombinera denna metod med en metod som räknar nyckelord för att undersöka hur noggrannheten skulle påverkas. Natural language processing Word Embedding Resume Ranking Semantic meaning Språkteknologi Ordinbäddning CV rankning Semantisk betydelse Computer Sciences Datavetenskap (datalogi)
206	Conversational Engine for Transportation Systems Sidås, Albin, Sandberg, Simon January 2021 (has links) Today's communication between operators and professional drivers takes place through direct conversations between the parties. This thesis project explores the possibility to support the operators in classifying the topic of incoming communications and which entities are affected through the use of named entity recognition and topic classifications. By developing a synthetic training dataset, a NER model and a topic classification model was developed and evaluated to achieve F1-scores of 71.4 and 61.8 respectively. These results were explained by a low variance in the synthetic dataset in comparison to a transcribed dataset from the real world which included anomalies not represented in the synthetic dataset. The aforementioned models were integrated into the dialogue framework Emora to seamlessly handle the back and forth communication and generating responses. Natural Language Processing Topic Classification Named Entity Classification NLP NER NERC
207	Building high-quality datasets for abstractive text summarization : A filtering‐based method applied on Swedish news articles Monsen, Julius January 2021 (has links) With an increasing amount of information on the internet, automatic text summarization could potentially make content more readily available for a larger variety of people. Training and evaluating text summarization models require datasets of sufficient size and quality. Today, most such datasets are in English, and for minor languages such as Swedish, it is not easy to obtain corresponding datasets with handwritten summaries. This thesis proposes methods for compiling high-quality datasets suitable for abstractive summarization from a large amount of noisy data through characterization and filtering. The data used consists of Swedish news articles and their preambles which are here used as summaries. Different filtering techniques are applied, yielding five different datasets. Furthermore, summarization models are implemented by warm-starting an encoder-decoder model with BERT checkpoints and fine-tuning it on the different datasets. The fine-tuned models are evaluated with ROUGE metrics and BERTScore. All models achieve significantly better results when evaluated on filtered test data than when evaluated on unfiltered test data. Moreover, models trained on the most filtered dataset with the smallest size achieves the best results on the filtered test data. The trade-off between dataset size and quality and other methodological implications of the data characterization, the filtering and the model implementation are discussed, leading to suggestions for future research. NLP abstractive text summarization dataset quality encoder-decoder model BERT
208	Extractive Text Summarization of Greek News Articles Based on Sentence-Clusters Kantzola, Evangelia January 2020 (has links) This thesis introduces an extractive summarization system for Greek news articles based on sentence clustering. The main purpose of the paper is to evaluate the impact of three different types of text representation, Word2Vec embeddings, TF-IDF and LASER embeddings, on the summarization task. By taking these techniques into account, we build three different versions of the initial summarizer. Moreover, we create a new corpus of gold standard summaries to evaluate them against the system summaries. The new collection of reference summaries is merged with a part of the MultiLing Pilot 2011 in order to constitute our main dataset. We perform both automatic and human evaluation. Our automatic ROUGE results suggest that System A which employs Average Word2Vec vectors to create sentence embeddings, outperforms the other two systems by yielding higher ROUGE-L F-scores. Contrary to our initial hypotheses, System C using LASER embeddings fails to surpass even the Word2Vec embeddings method, showing sometimes a weak sentence representation. With regard to the scores obtained by the manual evaluation task, we observe that System A using Average Word2Vec vectors and System C with LASER embeddings tend to produce more coherent and adequate summaries than System B employing TF-IDF. Furthermore, the majority of system summaries are rated very high with respect to non-redundancy. Overall, System A utilizing Average Word2Vec embeddings performs quite successfully according to both evaluations. text summarization nlp machine learning word embeddings sentence embeddings clustering
209	A Rule-Based Normalization System for Greek Noisy User-Generated Text Toska, Marsida January 2020 (has links) The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. Therefore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic text processing with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard word forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance (Levenshtein distance approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization. nlp noisy text preprocessing rule-based levenshtein twitter normalization Greek
210	Computational Terminology : Exploring Bilingual and Monolingual Term Extraction Foo, Jody January 2012 (has links) Terminologies are becoming more important to modern day society as technology and science continue to grow at an accelerating rate in a globalized environment. Agreeing upon which terms should be used to represent which concepts and how those terms should be translated into different languages is important if we wish to be able to communicate with as little confusion and misunderstandings as possible. Since the 1990s, an increasing amount of terminology research has been devoted to facilitating and augmenting terminology-related tasks by using computers and computational methods. One focus for this research is Automatic Term Extraction (ATE). In this compilation thesis, studies on both bilingual and monolingual ATE are presented. First, two publications reporting on how bilingual ATE using the align-extract approach can be used to extract patent terms. The result in this case was 181,000 manually validated English-Swedish patent terms which were to be used in a machine translation system for patent documents. A critical component of the method used is the Q-value metric, presented in the third paper, which can be used to rank extracted term candidates (TC) in an order that correlates with TC precision. The use of Machine Learning (ML) in monolingual ATE is the topic of the two final contributions. The first ML-related publication shows that rule induction based ML can be used to generate linguistic term selection patterns, and in the second ML-related publication, contrastive n-gram language models are used in conjunction with SVM ML to improve the precision of term candidates selected using linguistic patterns. terminology automatic term extraction automatic term recognition computational terminology terminology management

Search results