Spelling suggestions: "subject:"databehandling"" "subject:"databehandlings""
121 |
Neural Network Based Automatic Essay Scoring for Swedish / Neurala nätverk för automatisk bedömning av uppsatser i nationella prov i svenskaRuan, Rex Dajun January 2020 (has links)
This master thesis work presents a novel method of automatic essay scoring for Swedish national tests written by upper secondary high school students by deploying neural network architectures and linguistic feature extraction in the framework of Swegram. There are four sorts of linguistic aspects involved in our feature extraction: count-based,lexical morphological and syntactic. One of the three variants of recurrent network, vanilla RNN, GRU and LSTM, together with the specific model parameter setting, is implemented in the Automatic Essay Scoring (AES) modelling with extracted features measuring the linguistic complexity as text representation. The AES model is evaluated through interrater agreement with human assigned grade as target label in terms of quadratic weighted kappa (QWK) and exact percent agreement. Our best observed averaged QWK and averaged exact percent agreement is 0.50 and 52% over 10 folds among our all experimented models.
|
122 |
Multilingual Dependency Parsing of Uralic Languages : Parsing with zero-shot transfer and cross-lingual models using geographically proximate, genealogically related, and syntactically similar transfer languagesErenmalm, Elsa January 2020 (has links)
One way to improve dependency parsing scores for low-resource languages is to make use of existing resources from other closely related or otherwise similar languages. In this paper, we look at eleven Uralic target languages (Estonian, Finnish, Hungarian, Karelian, Livvi, Komi Zyrian, Komi Permyak, Moksha, Erzya, North Sámi, and Skolt Sámi) with treebanks of varying sizes and select transfer languages based on geographical, genealogical, and syntactic distances. We focus primarily on the performance of parser models trained on various combinations of geographically proximate and genealogically related transfer languages, in target-trained, zero-shot, and cross-lingual configurations. We find that models trained on combinations of geographically proximate and genealogically related transfer languages reach the highest LAS in most zero-shot models, while our highest-performing cross-lingual models were trained on genealogically related languages. We also find that cross-lingual models outperform zero-shot transfer models. We then select syntactically similar transfer languages for three target languages, and find a slight improvement in the case of Hungarian. We discuss the results and conclude with suggestions for possible future work.
|
123 |
Extracting Text into Meta-Data : Improving machine text-understanding of news-media articles / Extrahera Meta-Data från texter : Förbättra förståelsen för nyheter med hjälp av maskininlärningLindén, Johannes January 2021 (has links)
Society is constantly in need of information. It is important to consume event-based information of what is happening around us as well as facts and knowledge. As society grows, the amount of information to consume grows with it. This thesis demonstrates one way to extract and represent knowledge from text in a machine-readable way for news media articles. Three objectives are considered when developing a machine learning system to retrieve categories, entities, relations and other meta-data from text paragraphs. The first is to sort the terminology by topic; this makes it easier for machine learning algorithms to understand the text and the unique words used. The second objective is to construct a service for use in production, where scalability and performance are evaluated. Features are implemented to iteratively improve the model predictions, and several versions are run at the same time to, for example, compare them in an A/B test. The third objective is to further extract the gist of what is expressed in the text. The gist is extracted in the form of triples by connecting two related entities using a combination of natural language processing algorithms. The research presents a comparison between five different auto categorization algorithms, and an evaluation of their hyperparameters and how they would perform under the pressure of thousands of big, concurrent predictions. The aim is to build an auto-categorization system that can be used in the news media industry to help writers and journalists focus more on the story rather than filling in meta-data for each article. The best-performing algorithm is a Bidirectional Long-Short-Term-Memory neural network. Three different information extraction algorithms for extracting the gist of paragraphs are also compared. The proposed information extraction algorithm supports extracting information from texts in multiple languages with competitive accuracy compared with the state-of-the-art OpenIE and MinIE algorithms that can extract information in a single language. The use of the multi-linguistic models helps local-news media to write articles in different languages as a help to integrate immigrants into the society. / <p>Vid tidpunkten för presentationen var följande delarbeten opublicerade: delarbete 4 inskickat.</p><p>At the time of the public defence the following papers were unpublished: paper 4 submitted.</p>
|
124 |
Smoothening of Software documentation : comparing a self-made sequence to sequence model to a pre-trained model GPT-2 / Utjämning av mjukvarudokumentationTao, Joakim, Thimrén, David January 2021 (has links)
This thesis was done in collaboration with Ericsson AB with the goal of researching the possibility of creating a machine learning model that can transfer the style of a text into another arbitrary style depending on the data used. This had the purpose of making their technical documentation appear to have been written with one cohesive style for a better reading experience. Two approaches to solve this task were tested, the first one was to implement an encoder-decoder model from scratch, and the second was to use the pre-trained GPT-2 model created by a team from OpenAI and fine-tune the model on the specific task. Both of these models were trained on data provided by Ericsson, sentences were extracted from their documentation. To evaluate the models training loss, test sentences, and BLEU scores were used and these were compared to each other and with other state-of-the-art models. The models did not succeed in transforming text into a general technical documentation style but a good understanding of what would need to be improved and adjusted to improve the results were obtained. / <p>This thesis was presented on June 22, 2021, the presentation was done online on Microsoft teams. </p>
|
125 |
Automated Essay Scoring for English Using Different Neural Network Models for Text ClassificationDeng, Xindi January 2021 (has links)
Written skills are an essential evaluation criterion for a student’s creativity, knowledge, and intellect. Consequently, academic writing is a common part of university and college admissions applications, standardized tests, and classroom assessments. However, the task for teachers is quite daunting when it comes to essay scoring. Then Automated Essay Scoring may be a helpful tool in the decision-making by the teacher. There have been many successful models with supervised or unsupervised machine learning algorithms in the eld of Automated Essay Scoring. This thesis work makes a comparative study among various neural network models with supervised machine learning algorithms and different linguistic feature combinations. It also proves that the same linguistic features are applicable to more than one language. The models studied in this experiment include TextCNN, TextRNN_LSTM, Tex- tRNN_GRU, and TextRCNN trained with the essays from the Automated Student Assessment Prize (ASAP) from Kaggle competitions. Each essay is represented with linguistic features measuring linguistic complexity. Those features are divided into four groups: count-based, morphological, syntactic, and lexical features, and the four groups of features can form a total of 14 combinations. The models are evaluated via three measurements: Accuracy, F1 score, and Quadratic Weighted Kappa. The experimental results show that models trained only with count-based features outperform the models trained using other feature combinations. In addition, TextRNN_LSTM performs best, with an accuracy of 54.79%, an F1 score of 0.55, and a Quadratic Weighted Kappa of 0.59, which beats the statistically-based baseline models.
|
126 |
Targeted Topic Modeling for Levantine ArabicZahra, Shorouq January 2020 (has links)
Topic models for focused analysis aim to capture topics within the limiting scope of a targeted aspect (which could be thought of as some inner topic within a certain domain). To serve their analytic purposes, topics are expected to be semantically-coherent and closely aligned with human intuition – this in itself poses a major challenge for the more common topic modeling algorithms which, in a broader sense, perform a full analysis that covers all aspects and themes within a collection of texts. The paper attempts to construct a viable focused-analysis topic model which learns topics from Twitter data written in a closely related group of non-standardized varieties of Arabic widely spoken in the Levant region (i.e Levantine Arabic). Results are compared to a baseline model as well as another targeted topic model designed precisely to serve the purpose of focused analysis. The model is capable of adequately capturing topics containing terms which fall within the scope of the targeted aspect when judged overall. Nevertheless, it fails to produce human-friendly and semantically-coherent topics as several topics contained a number of intruding terms while others contained terms, while still relevant to the targeted aspect, thrown together seemingly at random.
|
127 |
Constructiveness-Based Product Review ClassificationLoobuyck, Ugo January 2020 (has links)
Promoting constructiveness in online comment sections is an essential step to make the internet a more productive place. On online marketplaces, customers often have the opportunity to voice their opinion and relate their experience with a given product. In this thesis, we investigate the possibility to model constructiveness in product review in order to promote the most informative and argumentative customer feedback. We develop a new constructiveness 4-class scale taxonomy based on heuristics and specific categorical criteria. We use this taxonomy to annotate 4000 Amazon customer reviews as our training set, referred to as the Corpus for Review Constructiveness (CRC). In addition to the 4-class constructiveness tag, we include a binary tag to compare modeling performance with previous work. We train and test several computational models such as Bidirectional Encoder Representations from Transformers (BERT), a Stacked Bidirectional LSTM and a Gradient Boosting Machine. We demonstrate our annotation scheme’s reliability with a set of inter-annotator agreement experiments, and show that good levels of performance can be reached in both multiclass setting (0.69 F1 and 57% error reduction over the baseline) and binary setting (0.85 F1 and 71% error reduction). Different features are evaluated individually and in combination. Moreover, we compare the advantages, downsides and performance of both feature-based and neural network models. Finally, these models trained on CRC are tested on out-of-domain data (news article comments) and shown to be nearly as proficient as on in-domain data. This work allows the extension of constuctiveness modeling to a new type of data and provides a new non-binary taxonomy for data labeling.
|
128 |
Exploring Cross-lingual Sublanguage Classification with Multi-lingual Word EmbeddingsShih, Min-Chun January 2020 (has links)
Cross-lingual text classification is an important task due to the globalization and the increased availability of multilingual data. This thesis explores the method of implementing cross-lingual classification on Swedish and English medical corpora. Specifically, this the- sis explores the simple convolutional neural network (CNN) with MUSE pre-trained word embeddings to approach binary classification of sublanguages (“lay” and “specialized”) from Swedish healthcare texts to English healthcare texts. MUSE is a library that provides state-of-the-art multilingual word embeddings and large-scale high-quality bilingual dictionaries. The thesis presents experiments with imbalanced and balanced class distribution on training data and test data to examine the effect of class distribution, and also examine the influences of clean test dataset and noisy test dataset. The results show that balanced distribution of classes in training data performs significantly better than the training data with imbalanced class distribution, and clean test data gives the benefit of transferring the labels from one language to another. The thesis also compares the performance of the simple convolutional neural network model with the Naive Bayes baseline. Results show that on this task a simple Naive Bayes classifier based on bag-of-words translated using MUSE English-Swedish dictionary outperforms a simple CNN model based on MUSE pre-trained word embeddings in several experimental settings.
|
129 |
Cross-lingual and Multilingual Automatic Speech Recognition for Scandinavian LanguagesČerniavski, Rafal January 2022 (has links)
Research into Automatic Speech Recognition (ASR), the task of transforming speech into text, remains highly relevant due to its countless applications in industry and academia. State-of-the-art ASR models are able to produce nearly perfect, sometimes referred to as human-like transcriptions; however, accurate ASR models are most often available only in high-resource languages. Furthermore, the vast majority of ASR models are monolingual, that is, only able to handle one language at a time. In this thesis, we extensively evaluate the quality of existing monolingual ASR models for Swedish, Danish, and Norwegian. In addition, we search for parallels between monolingual ASR models and the cognition of foreign languages in native speakers of these languages. Lastly, we extend the Swedish monolingual model to handle all three languages. The research conducted in this thesis project is divided into two main sections, namely monolingual and multilingual models. In the former, we analyse and compare the performance of monolingual ASR models for Scandinavian languages in monolingual and cross-lingual settings. We compare these results against the levels of mutual intelligibility of Scandinavian languages in native speakers of Swedish, Danish, and Norwegian to see whether the monolingual models favour the same languages as native speakers. We also examine the performance of the monolingual models on the regional dialects of all three languages and perform qualitative analysis of the most common errors. As for multilingual models, we expand the most accurate monolingual ASR model to handle all three languages. To do so, we explore the most suitable settings via trial models. In addition, we propose an extension to the well-established Wav2Vec 2.0-CTC architecture by incorporating a language classification component. The extension enables the usage of language models, thus boosting the overall performance of the multilingual models. The results reported in this thesis suggest that in a cross-lingual setting, monolingual ASR models for Scandinavian languages perform better on the languages that are easier to comprehend for native speakers. Furthermore, the addition of a statistical language model boosts the performance of ASR models in monolingual, cross-lingual, and multilingual settings. ASR models appear to favour certain regional dialects, though the gap narrows in a multilingual setting. Contrary to our expectations, our multilingual model performs comparably with the monolingual Swedish ASR models and outperforms the Danish and Norwegian models. The multilingual architecture proposed in this thesis project is fairly simple yet effective. With greater computational resources at hand, further extensions offered in the conclusions might improve the models further.
|
130 |
Evaluating Transcription of Ciphers with Few-Shot LearningMilioni, Nikolina January 2022 (has links)
Ciphers are encrypted documents created to hide their content from those who were not the receivers of the message. Different types of symbols, such as zodiac signs, alchemical symbols, alphabet letters or digits are exploited to compose the encrypted text which needs to be decrypted to gain access to the content of the documents. The first step before decryption is the transcription of the cipher. The purpose of this thesis is to evaluate an automatic transcription tool from image to a text format to provide a transcription of the cipher images. We implement a supervised few-shot deep-learning model which is tested on different types of encrypted documents and use various evaluation metrics to assess the results. We show that the few-shot model presents promising results on seen data with Symbol Error Rates (SER) ranging from 8.21% to 47.55% and accuracy scores from 80.13% to 90.27%, whereas SER in out-of-domain datasets reaches 79.91%. While a wide range of symbols are correctly transcribed, the erroneous symbols mainly contain diacritics or are punctuation marks.
|
Page generated in 0.0707 seconds