• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 333
  • 48
  • Tagged with
  • 381
  • 372
  • 339
  • 325
  • 321
  • 314
  • 314
  • 104
  • 91
  • 88
  • 85
  • 83
  • 75
  • 67
  • 61
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
131

Question Classification in Question Answering Systems

Sundblad, Håkan January 2007 (has links)
Question answering systems can be seen as the next step in information retrieval, allowing users to pose questions in natural language and receive succinct answers. In order for a question answering system as a whole to be successful, research has shown that the correct classification of questions with regards to the expected answer type is imperative. Question classification has two components: a taxonomy of answer types, and a machinery for making the classifications. This thesis focuses on five different machine learning algorithms for the question classification task. The algorithms are k nearest neighbours, naïve bayes, decision tree learning, sparse network of winnows, and support vector machines. These algorithms have been applied to two different corpora, one of which has been used extensively in previous work and has been constructed for a specific agenda. The other corpus is drawn from a set of users' questions posed to a running online system. The results showed that the performance of the algorithms on the different corpora differs both in absolute terms, as well as with regards to the relative ranking of them. On the novel corpus, naïve bayes, decision tree learning, and support vector machines perform on par with each other, while on the biased corpus there is a clear difference between them, with support vector machines being the best and naïve bayes being the worst. The thesis also presents an analysis of questions that are problematic for all learning algorithms. The errors can roughly be divided as due to categories with few members, variations in question formulation, the actual usage of the taxonomy, keyword errors, and spelling errors. A large portion of the errors were also hard to explain. / <p>Report code: LiU-Tek-Lic-2007:29.</p>
132

Investigating Gender Bias in Word Embeddings for Chinese

Jiao, Meichun January 2021 (has links)
Gender bias, a sociological issue, has attracted the attention of scholars working on natural language processing (NLP) in recent years. It is confirmed that some NLP techniques like word embedding could capture gender bias in natural language. Here, we investigate gender bias in Chinese word embeddings. Gender bias tests originally designed for English are adapted and applied to Chinese word embeddings trained with three different embedding models. After verifying the efficiency of the adapted tests, the changes of gender bias throughout several time periods are tracked and analysed. Our results validate the feasibility of bias test adaptation and confirm that word embedding trained by a  model with character-level information captures more gender bias in general. Moreover, we build a possible framework for diachronic research of gender bias.
133

Unveiling the Swedish philosophical landscape : A topic model study of the articles of a Swedish philosophical journal from 1980-2020

Lindqvist, Björn January 2023 (has links)
Bibliometric research is an important tool for examining the scientific output of various fields of study. By conducting such research, it is possible to see how the influences of different people, ideologies and discoveries have affected the scientific discourse. One way of doing this is through topic modelling, which consists of organizing the words that are used within a set of text data into different topics. To the knowledge of the author, no topic modelling study of Swedish philosophy had previously been conducted. For this reason, this study aimed to partially fill the gap by exploring the publications of one specific Swedish philosophical journal. Using Python, a topic model with 14 topics was created from the journal Filosofisk tidskrift. The change of these topics between the years 1980 and 2020 was examined. Specific attention was given to possible differences between analytic and Continental philosophy. To validate the results, an interview was also held with Fredrik Stjernberg, professor in theoretical philosophy. The results displayed a varied popularity and change for each topic. Too little Continental philosophy was discovered for a proper comparison, leading to the conclusion that Continental philosophy is not very influential in Swedish philosophical discourse. Future research should be conducted on peer-reviewed articles and be backed up by greater professional philosophical aid.
134

An End-to-End Native Language Identification Model without the Need for Manual Annotation / En modersmålsidentifiering modell utan behov av manuell annotering

Buzaitė, Viktorija January 2022 (has links)
Native language identification (NLI) is a classification task which identifies the mother tongue of a language learner based on spoken or written material. The task gained popularity when it was featured in the 2017 BEA-12-workshop and since then many applications have been successfully found for NLI - ranging from language learning to authorship identification and forensic science. While a considerable amount of research has already been done in this area, we introduce a novel approach of incorporating syntactic information into the implementation of a BERT-based NLI model. In addition, we train separate models to test whether erroneous input sequences perform better than corrected sequences. To answer these questions we carry out both a quantitative and qualitative analysis. In addition, we test our idea of implementing a BERT-based GEC model to supply more training data to our NLI model without the need for manual annotation. Our results suggest that our models do not outperform the SVM baseline, but we attribute this result to the lack of training data in our dataset, as transformer-based architectures like BERT need huge amounts of data to be successfully fine-tuned. In turn, simple linear models like SVM perform well on small amounts of data. We also find that erroneous structures in data come useful when combined with syntactic information but neither boosts the performance of NLI model separately. Furthermore, our implemented GEC system performs well enough to produce more data for our NLI models, as their scores increase after implementing the additional data, resulting from our second experiment. We believe that our proposed architecture is potentially suitable for the NLI task if we incorporate extensions which we suggest in the conclusion section.
135

Readability Assessment with Pre-Trained Transformer Models : An Investigation with Neural Linguistic Features

Ma, Chuchu January 2022 (has links)
Readability assessment (RA) is to assign a score or a grade to a given document, which measures the degree of difficulty to read the document. RA originated in language education studies and was used to classify reading materials for language learners. Later, RA was applied to many other applications, such as aiding automatic text simplification.  This thesis is aimed at improving the way of using Transformer for RA. The motivation is the “pipeline” effect (Tenney et al., 2019) of pretrained Transformers: lexical, syntactic, and semantic features are best encoded with different layers of a Transformer model.  After a preliminary test of a basic RA model that resembles the previous works, we proposed several methods to enhance the performance: by using a Transformer layer that is not the last, by concatenating or mixing the outputs of all layers, and by using syntax-augmented Transformer layers. We examined these enhanced methods on three datasets: WeeBit, OneStopEnglish, and CommonLit.  We observed that the improvements showed a clear correlation with the dataset characteristics. On the OneStopEnglish and the CommonLit datasets, we achieved absolute improvements of 1.2% in F1 score and 0.6% in Pearson’s correlation coefficients, respectively. We also show that an 𝑛-gram frequency- based baseline, which is simple but was not reported in previous works, has superior performance on the classification datasets (WeeBit and OneStopEnglish), prompting further research on vocabulary-based lexical features for RA.
136

Document Expansion for Swedish Information Retrieval Systems / Dokumentexpansion för svenska informationssökningssystem

Hagström, Tobias January 2023 (has links)
Information retrieval systems have come to change how users interact with computerized systems and locate information. A major challenge when designing these systems is how to handle the vocabulary mismatch problem, i.e. that users, when formulating queries, pick different words than those present in the relevant documents that should be retrieved. With recent advances in artificial intelligence and the emergence of transformer-based language models, new methods have been proposed to alleviate this problem. One such method is the usage of document expansion models which append words to each document that are likely to be part of users’ queries. As previous research on document expansion models has been focused on English-language applications, this thesis investigates the effectiveness of one such model for Swedish applications. Although no improvement was found when using this method, the result is likely to be a consequence of dataset quality and domain rather than the method itself. / Informationssökningssystem har förändrat hur användare interagerar med datorsystem och lokaliserar information. En betydande utmaning när dessa system designas är hur det s.k. ”vocabulary mismatch”-problemet ska hanteras, d.v.s. att användare väljer andra söktermer än de som förekommer i de relevanta dokumenten som söksystemet ska hitta. Nya framsteg inom artificiell intelligens och utvecklingen av transformer-baserade språkmodeller har lett till att nya metoder har föreslagits för att mildra det här problemet. En sådan metod är att använda dokumentexpansionsmodeller som lägger till ord till varje dokument som är sannolika att förekomma som söktermer. Då tidigare forskning på dokumentexpansionsmodeller har fokuserat på engelskspråkiga tillämpningar fokuserar det här arbetet i stället på hur väl sådana modeller fungerar för svenskspråkiga tillämpningar. Även om ingen förbättring observerades när denna metod tillämpades är resultatet sannolikt en konsekvens av datamängdens kvalitet och domän snarare än metoden i sig.
137

Evaluating and Fine-Tuning a Few-Shot Model for Transcription of Historical Ciphers

Eliasson, Ingrid January 2023 (has links)
Thousands of historical ciphers, encrypted manuscripts, are stored in archives across Europe. Historical cryptology is the research field concerned with studying these manuscripts - combining the interest of humanistic fields with methods of cryptography and computational linguistics. Before a cipher can be decrypted by automatic means, it must first be transcribed into machine-readable digital text. Image processing techniques and Deep Learning have enabled transcription of handwritten text to be performed automatically, but the task faces challenges when ciphers constitute the target data. The main reason is a lack of labeled data, caused by the heterogeneity of handwriting and the tendency of ciphers to employ unique symbol sets. Few-Shot Learning is a machine learning framework which reduces the need for labeled data, using pretrained models in combination with support sets containing a few labeled examples from the target data set. This project is concerned with evaluating a Few-Shot model on the task of transcription of historical ciphers. The model is tested on pages from three in-domain ciphers which vary in handwriting style and symbol sets. The project also investigates the use of further fine-tuning the model by training it on a limited amount of labeled symbol examples from the respective target ciphers. We find that the performance of the model is dependant on the handwriting style of the target document, and that certain model parameters should be explored individually for each data set. We further show that fine-tuning the model is indeed efficient, lowering the Symbol Error Rate (SER) at best 27.6 percentage points.
138

Disambiguating Italian homographic heterophones with SoundChoice and testing ChatGPT as a data-generating tool

Nanni, Matilde January 2023 (has links)
Text-To-Speech systems are challenged by the presence of homographs, words that have more than one possible pronunciation. Rule-based approaches are often still the preferred solution to this issue in the industry. However, there have been multiple attempts to solve the ‘homograph issue’, by exploring statistical-based, neural-based, and hybrid techniques, mostly for English. Ploujnikov and Ravanelli (2022) proposed a neural-based grapheme-to-phoneme framework, SoundChoice, which comes as an RNN and a transformer version and can be fine-tuned for homograph disambiguation thanks to a weighted homograph loss. This thesis trains and tests this framework on Italian, instead of English, to see how it performs on a different language. Moreover, seeing as the available data containing homographs was insufficient for this task, the thesis experiments using ChatGPT as a data-generating tool. SoundChoice was also investigated for out-of-domain evaluation by testing it on data from a Corpus. The results showed that the RNN model reached a 71% accuracy from a baseline of 59%. A better performance was observed for the transformers model which went from 57% to 74%. Further analysis would be needed to draw more solid conclusions as to the origin of this gap and the models should be trained on Corpus data and tested on ChatGPT data to assess whether ChatGPT-generated data is, indeed, suitable as a replacement for Corpus data.
139

Domain-specific knowledge graph construction from Swedish and English news articles

Krupinska, Aleksandra January 2023 (has links)
In the current age of new textual information emerging constantly, there is a challenge related to processing and structuring it in some ways. Moreover, the information is often expressed in many different languages, but the discourse tends to be dominated by English, which may lead to overseeing important, specific knowledge in less well-resourced languages. Knowledge graphs have been proposed as a way of structuring unstructured data, making it machine-readable and available for further processing. Researchers have also emphasized the potential bilateral benefits of combining knowledge in low- and well-resourced languages.  In this thesis, I combine the two goals of structuring textual data with the help of knowledge graphs and including multilingual information in an effort to achieve a more accurate knowledge representation. The purpose of the project is to investigate whether the information about three Swedish companies known worldwide - H&amp;M, Spotify, and Ikea - in Swedish and English data sources is the same and how combining the two sources can be beneficial. Following a natural language processing (NLP) pipeline consisting of such tasks as coreference resolution, entity linking, and relation extraction, a knowledge graph is constructed from Swedish and English news articles about the companies. Refinement techniques are applied to improve the graph. The constructed knowledge graph is analyzed with respect to the overlap of extracted entities and the complementarity of information. Different variants of the graph are further evaluated by human raters. A number of queries illustrate the capabilities of the constructed knowledge graph. The evaluation of the graph shows that the topics covered in the two information sources differ substantially. Only a small number of entities occur in both languages. Combining the two sources can, therefore, contribute to a richer and more connected knowledge graph. The adopted refinement techniques increase the connectedness of the graph. Human evaluators consequently chose the Swedish side of the data as more relevant for the considered questions, which points out the importance of not limiting the research to more easily available and processed English data.
140

Genre classification using syntactic features

Brigadoi, Ivan January 2021 (has links)
This thesis work adresses text classification in relation to genre identification using different feature sets, with a focus on syntactic based features. We built our models by means of traditional machine learning algorithms, i.e. Naive Bayes, K-nearest neighbour, Support Vector Machine and Random Forest in order to predict the literary genre of books. We trained our models using as feature sets bag-of-words (BOW), bigrams, syntactic-based bigrams and emotional features, as well as combinations of features. Results obtained using the best features, i.e. BOW combined with bigrams based on syntactic relations between words, on the test set showed an enhancement in performance by 2% in F1-score over the baseline using BOW features, which translates into a positive impact of using syntactic information in the task of text classification.

Page generated in 0.0891 seconds