Global ETD Search

131	A comparative study of the grammatical gender systems of languages by means of analysing word embeddings Veeman, Hartger January 2020 (has links) The creation of word embeddings is one of the key breakthroughs in natural language processing. Word embeddings allow for words to be represented semantically, opening the way to many new deep learning methods. Understanding what information is in word embeddings will help understanding the behaviour of embeddings in natural language processing tasks, but also allows for the quantitative study of the linguistic features such as grammatical gender. This thesis attempts to explore how grammatical gender is encoded in word embeddings, through analysing the performance of a neural network classifier on the classification of nouns by gender. This analysis is done in three experiments: an analysis of contextualized embeddings, an analysis of embeddings learned from modified corpora and an analysis of aligned embeddings in many languages. The contextualized word embedding model ELMo has multiple output layers with a gradual increasing presence of semantic information in the embedding. This differing presence of semantic information was used to test the classifier's reliance on semantic information. Swedish, German, Spanish and Russian embeddings were classified at all layers of a three layered ELMo model. The word representation layer without any contextualization was found to produce the best accuracy, indicating the noise introduced by the contextualization was more impactful than any potential extra semantic information. Swedish embeddings were learned from a corpus stripped of articles and a stemmed corpus. Both sets of embeddings showed an drop of about 6% in accuracy in comparison with the embeddings from a non-augmented corpus, indicating agreement plays a large role in the classification. Aligned multilingual embeddings were used to measure the accuracy of a grammatical gender classifier in 24 languages. The classifier models were applied to data of other languages to determine the similarity of the encoding of grammatical gender in these embeddings. Correcting the results with a random guessing baseline shows that transferred models can be highly accurate in certain language combinations and in some cases almost approach the accuracy of the model on its source data. A comparison between transfer accuracy and phylogenetic distance showed that the model transferability follows a pattern that resembles the phylogenetic distance. word embeddings grammatical gender computational linguistics language representations
132	Anchor-based Topic Modeling with Human Interpretable Results / Tolkningsbara ämnesmodeller baserade på ankarord Andersson, Henrik January 2020 (has links) Topic models are useful tools for exploring large data sets of textual content by exposing a generative process from which the text was produced. Anchor-based topic models utilize the anchor word assumption to define a set of algorithms with provable guarantees which recover the underlying topics with a run time practically independent of corpus size. A number of extensions to the initial anchor word-based algorithms, and enhancements made to tangential models, have been proposed which improve the intrinsic characteristics of the model making them more interpretable by humans. This thesis evaluates improvements to human interpretability due to: low-dimensional word embeddings in combination with a regularized objective function, automatic topic merging using tandem anchors, and utilizing word embeddings to synthetically increase corpus density. Results show that tandem anchors are viable vehicles for automatic topic merging, and that using word embeddings significantly improves the original anchor method across all measured metrics. Combining low-dimensional embeddings and a regularized objective results in computational downsides with small or no improvements to the metrics measured. topic models anchor-based anchor words efficient human interpretable
133	Question Classification in Question Answering Systems Sundblad, Håkan January 2007 (has links) Question answering systems can be seen as the next step in information retrieval, allowing users to pose questions in natural language and receive succinct answers. In order for a question answering system as a whole to be successful, research has shown that the correct classification of questions with regards to the expected answer type is imperative. Question classification has two components: a taxonomy of answer types, and a machinery for making the classifications. This thesis focuses on five different machine learning algorithms for the question classification task. The algorithms are k nearest neighbours, naïve bayes, decision tree learning, sparse network of winnows, and support vector machines. These algorithms have been applied to two different corpora, one of which has been used extensively in previous work and has been constructed for a specific agenda. The other corpus is drawn from a set of users' questions posed to a running online system. The results showed that the performance of the algorithms on the different corpora differs both in absolute terms, as well as with regards to the relative ranking of them. On the novel corpus, naïve bayes, decision tree learning, and support vector machines perform on par with each other, while on the biased corpus there is a clear difference between them, with support vector machines being the best and naïve bayes being the worst. The thesis also presents an analysis of questions that are problematic for all learning algorithms. The errors can roughly be divided as due to categories with few members, variations in question formulation, the actual usage of the taxonomy, keyword errors, and spelling errors. A large portion of the errors were also hard to explain. / <p>Report code: LiU-Tek-Lic-2007:29.</p> Question classification question answering machine learning taxonomy evaluation
134	Investigating Gender Bias in Word Embeddings for Chinese Jiao, Meichun January 2021 (has links) Gender bias, a sociological issue, has attracted the attention of scholars working on natural language processing (NLP) in recent years. It is confirmed that some NLP techniques like word embedding could capture gender bias in natural language. Here, we investigate gender bias in Chinese word embeddings. Gender bias tests originally designed for English are adapted and applied to Chinese word embeddings trained with three different embedding models. After verifying the efficiency of the adapted tests, the changes of gender bias throughout several time periods are tracked and analysed. Our results validate the feasibility of bias test adaptation and confirm that word embedding trained by a model with character-level information captures more gender bias in general. Moreover, we build a possible framework for diachronic research of gender bias. word embedding gender bias Chinese
135	Unveiling the Swedish philosophical landscape : A topic model study of the articles of a Swedish philosophical journal from 1980-2020 Lindqvist, Björn January 2023 (has links) Bibliometric research is an important tool for examining the scientific output of various fields of study. By conducting such research, it is possible to see how the influences of different people, ideologies and discoveries have affected the scientific discourse. One way of doing this is through topic modelling, which consists of organizing the words that are used within a set of text data into different topics. To the knowledge of the author, no topic modelling study of Swedish philosophy had previously been conducted. For this reason, this study aimed to partially fill the gap by exploring the publications of one specific Swedish philosophical journal. Using Python, a topic model with 14 topics was created from the journal Filosofisk tidskrift. The change of these topics between the years 1980 and 2020 was examined. Specific attention was given to possible differences between analytic and Continental philosophy. To validate the results, an interview was also held with Fredrik Stjernberg, professor in theoretical philosophy. The results displayed a varied popularity and change for each topic. Too little Continental philosophy was discovered for a proper comparison, leading to the conclusion that Continental philosophy is not very influential in Swedish philosophical discourse. Future research should be conducted on peer-reviewed articles and be backed up by greater professional philosophical aid. bibliometrics topic modelling LDA philosophy Swedish Continental analytic
136	An End-to-End Native Language Identification Model without the Need for Manual Annotation / En modersmålsidentifiering modell utan behov av manuell annotering Buzaitė, Viktorija January 2022 (has links) Native language identification (NLI) is a classification task which identifies the mother tongue of a language learner based on spoken or written material. The task gained popularity when it was featured in the 2017 BEA-12-workshop and since then many applications have been successfully found for NLI - ranging from language learning to authorship identification and forensic science. While a considerable amount of research has already been done in this area, we introduce a novel approach of incorporating syntactic information into the implementation of a BERT-based NLI model. In addition, we train separate models to test whether erroneous input sequences perform better than corrected sequences. To answer these questions we carry out both a quantitative and qualitative analysis. In addition, we test our idea of implementing a BERT-based GEC model to supply more training data to our NLI model without the need for manual annotation. Our results suggest that our models do not outperform the SVM baseline, but we attribute this result to the lack of training data in our dataset, as transformer-based architectures like BERT need huge amounts of data to be successfully fine-tuned. In turn, simple linear models like SVM perform well on small amounts of data. We also find that erroneous structures in data come useful when combined with syntactic information but neither boosts the performance of NLI model separately. Furthermore, our implemented GEC system performs well enough to produce more data for our NLI models, as their scores increase after implementing the additional data, resulting from our second experiment. We believe that our proposed architecture is potentially suitable for the NLI task if we incorporate extensions which we suggest in the conclusion section. native language identification NLI BERT syntax errors
137	Readability Assessment with Pre-Trained Transformer Models : An Investigation with Neural Linguistic Features Ma, Chuchu January 2022 (has links) Readability assessment (RA) is to assign a score or a grade to a given document, which measures the degree of difficulty to read the document. RA originated in language education studies and was used to classify reading materials for language learners. Later, RA was applied to many other applications, such as aiding automatic text simplification. This thesis is aimed at improving the way of using Transformer for RA. The motivation is the “pipeline” effect (Tenney et al., 2019) of pretrained Transformers: lexical, syntactic, and semantic features are best encoded with different layers of a Transformer model. After a preliminary test of a basic RA model that resembles the previous works, we proposed several methods to enhance the performance: by using a Transformer layer that is not the last, by concatenating or mixing the outputs of all layers, and by using syntax-augmented Transformer layers. We examined these enhanced methods on three datasets: WeeBit, OneStopEnglish, and CommonLit. We observed that the improvements showed a clear correlation with the dataset characteristics. On the OneStopEnglish and the CommonLit datasets, we achieved absolute improvements of 1.2% in F1 score and 0.6% in Pearson’s correlation coefficients, respectively. We also show that an 𝑛-gram frequency- based baseline, which is simple but was not reported in previous works, has superior performance on the classification datasets (WeeBit and OneStopEnglish), prompting further research on vocabulary-based lexical features for RA. Readability Assessment Pre-trained Transformer Models Neural Linguistic Features
138	Document Expansion for Swedish Information Retrieval Systems / Dokumentexpansion för svenska informationssökningssystem Hagström, Tobias January 2023 (has links) Information retrieval systems have come to change how users interact with computerized systems and locate information. A major challenge when designing these systems is how to handle the vocabulary mismatch problem, i.e. that users, when formulating queries, pick different words than those present in the relevant documents that should be retrieved. With recent advances in artificial intelligence and the emergence of transformer-based language models, new methods have been proposed to alleviate this problem. One such method is the usage of document expansion models which append words to each document that are likely to be part of users’ queries. As previous research on document expansion models has been focused on English-language applications, this thesis investigates the effectiveness of one such model for Swedish applications. Although no improvement was found when using this method, the result is likely to be a consequence of dataset quality and domain rather than the method itself. / Informationssökningssystem har förändrat hur användare interagerar med datorsystem och lokaliserar information. En betydande utmaning när dessa system designas är hur det s.k. ”vocabulary mismatch”-problemet ska hanteras, d.v.s. att användare väljer andra söktermer än de som förekommer i de relevanta dokumenten som söksystemet ska hitta. Nya framsteg inom artificiell intelligens och utvecklingen av transformer-baserade språkmodeller har lett till att nya metoder har föreslagits för att mildra det här problemet. En sådan metod är att använda dokumentexpansionsmodeller som lägger till ord till varje dokument som är sannolika att förekomma som söktermer. Då tidigare forskning på dokumentexpansionsmodeller har fokuserat på engelskspråkiga tillämpningar fokuserar det här arbetet i stället på hur väl sådana modeller fungerar för svenskspråkiga tillämpningar. Även om ingen förbättring observerades när denna metod tillämpades är resultatet sannolikt en konsekvens av datamängdens kvalitet och domän snarare än metoden i sig. Information retrieval Natural language processing Deep learning informationssökningssystem språkteknologi djupinlärning Computer and Information Sciences Data- och informationsvetenskap
139	Evaluating and Fine-Tuning a Few-Shot Model for Transcription of Historical Ciphers Eliasson, Ingrid January 2023 (has links) Thousands of historical ciphers, encrypted manuscripts, are stored in archives across Europe. Historical cryptology is the research field concerned with studying these manuscripts - combining the interest of humanistic fields with methods of cryptography and computational linguistics. Before a cipher can be decrypted by automatic means, it must first be transcribed into machine-readable digital text. Image processing techniques and Deep Learning have enabled transcription of handwritten text to be performed automatically, but the task faces challenges when ciphers constitute the target data. The main reason is a lack of labeled data, caused by the heterogeneity of handwriting and the tendency of ciphers to employ unique symbol sets. Few-Shot Learning is a machine learning framework which reduces the need for labeled data, using pretrained models in combination with support sets containing a few labeled examples from the target data set. This project is concerned with evaluating a Few-Shot model on the task of transcription of historical ciphers. The model is tested on pages from three in-domain ciphers which vary in handwriting style and symbol sets. The project also investigates the use of further fine-tuning the model by training it on a limited amount of labeled symbol examples from the respective target ciphers. We find that the performance of the model is dependant on the handwriting style of the target document, and that certain model parameters should be explored individually for each data set. We further show that fine-tuning the model is indeed efficient, lowering the Symbol Error Rate (SER) at best 27.6 percentage points. historical ciphers few-shot learning automatic transcription
140	Disambiguating Italian homographic heterophones with SoundChoice and testing ChatGPT as a data-generating tool Nanni, Matilde January 2023 (has links) Text-To-Speech systems are challenged by the presence of homographs, words that have more than one possible pronunciation. Rule-based approaches are often still the preferred solution to this issue in the industry. However, there have been multiple attempts to solve the ‘homograph issue’, by exploring statistical-based, neural-based, and hybrid techniques, mostly for English. Ploujnikov and Ravanelli (2022) proposed a neural-based grapheme-to-phoneme framework, SoundChoice, which comes as an RNN and a transformer version and can be fine-tuned for homograph disambiguation thanks to a weighted homograph loss. This thesis trains and tests this framework on Italian, instead of English, to see how it performs on a different language. Moreover, seeing as the available data containing homographs was insufficient for this task, the thesis experiments using ChatGPT as a data-generating tool. SoundChoice was also investigated for out-of-domain evaluation by testing it on data from a Corpus. The results showed that the RNN model reached a 71% accuracy from a baseline of 59%. A better performance was observed for the transformers model which went from 57% to 74%. Further analysis would be needed to draw more solid conclusions as to the origin of this gap and the models should be trained on Corpus data and tested on ChatGPT data to assess whether ChatGPT-generated data is, indeed, suitable as a replacement for Corpus data. ChatGPT Homograph disambiguation Italian SoundChoice

Search results