• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 324
  • 48
  • Tagged with
  • 372
  • 363
  • 331
  • 317
  • 313
  • 306
  • 306
  • 103
  • 88
  • 86
  • 83
  • 82
  • 73
  • 67
  • 59
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
121

Argument Mining: Claim Annotation, Identification, Verification

Karamolegkou, Antonia January 2021 (has links)
Researchers writing scientific articles summarize their work in the abstracts mentioning the final outcome of their study. Argumentation mining can be used to extract the claim of the researchers as well as the evidence that could support their claim. The rapid growth of scientific articles demands automated tools that could help in the detection and evaluation of the scientific claims’ veracity. However, there are neither a lot of studies focusing on claim identification and verification neither a lot of annotated corpora available to effectively train deep learning models. For this reason, we annotated two argument mining corpora and perform several experiments with state-of-the-art BERT-based models aiming to identify and verify scientific claims. We find that using SciBERT provides optimal results regardless of the dataset. Furthermore, increasing the amount of training data can improve the performance of every model we used. These findings highlight the need for large-scale argument mining corpora, as well as domain-specific pre-trained models.
122

Analyzing the Anisotropy Phenomenon in Transformer-based Masked Language Models / En analys av anisotropifenomenet i transformer-baserade maskerade språkmodeller

Luo, Ziyang January 2021 (has links)
In this thesis, we examine the anisotropy phenomenon in popular masked language models, BERT and RoBERTa, in detail. We propose a possible explanation for this unreasonable phenomenon. First, we demonstrate that the contextualized word vectors derived from pretrained masked language model-based encoders share a common, perhaps undesirable pattern across layers. Namely, we find cases of persistent outlier neurons within BERT and RoBERTa's hidden state vectors that consistently bear the smallest or largest values in said vectors. In an attempt to investigate the source of this information, we introduce a neuron-level analysis method, which reveals that the outliers are closely related to information captured by positional embeddings. Second, we find that a simple normalization method, whitening can make the vector space isotropic. Lastly, we demonstrate that ''clipping'' the outliers or whitening can more accurately distinguish word senses, as well as lead to better sentence embeddings when mean pooling.
123

Named-entity recognition with BERT for anonymization of medical records

Bridal, Olle January 2021 (has links)
Sharing data is an important part of the progress of science in many fields. In the largely deep learning dominated field of natural language processing, textual resources are in high demand. In certain domains, such as that of medical records, the sharing of data is limited by ethical and legal restrictions and therefore requires anonymization. The process of manual anonymization is tedious and expensive, thus automated anonymization is of great value. Since medical records consist of unstructured text, pieces of sensitive information have to be identified in order to be masked for anonymization. Named-entity recognition (NER) is the subtask of information extraction named entities, such as person names or locations, are identified and categorized. Recently, models that leverage unsupervised training on large quantities of unlabeled training data have performed impressively on the NER task, which shows promise in their usage for the problem of anonymization. In this study, a small set of medical records was annotated with named-entity tags. Because of the lack of any training data, a BERT model already fine-tuned for NER was then evaluated on the evaluation set. The aim was to find out how well the model would perform on NER on medical records, and to explore the possibility of using the model to anonymize medical records. The most positive result was that the model was able to identify all person names in the dataset. The average accuracy for identifying all entity types was however relatively low. It is discussed that the success of identifying person names shows promise in the model’s application for anonymization. However, because the overall accuracy is significantly worse than that of models fine-tuned on domain-specific data, it is suggested that there might be better methods for anonymization in the absence of relevant training data.
124

Improving Solr search with Natural Language Processing : An NLP implementation for information retrieval in Solr / Att förbättra Solr med Natural Language Processing

Lager, Adam January 2021 (has links)
The field of AI is emerging fast and institutions and companies are pushing the limits of impossibility. Natural Language Processing is a branch of AI where the goal is to understand human speech and/or text. This technology is used to improve an inverted index,the full text search engine Solr. Solr is open source and has integrated OpenNLP makingit a suitable choice for these kinds of operations. NLP-enabled Solr showed great results compared to the Solr that’s currently running on the systems, where NLP-Solr was slightly worse in terms of precision, it excelled at recall and returning the correct documents.
125

Automated Digitization and Summarization of Analog Archives : Comparing summaries made by GPT-3 and a human

Linderholm, Maja January 2022 (has links)
This thesis aimed to create a tool that could assist climate researchers in their fieldwork. Through dialog with researchers at Stockholms University a need and interest for automated digitization and summarization of their handwritten notes could be identified. Climate research may require work conducted out in the field and during fieldwork, many researchers prefer to take handwritten notes which can generate large physical archives. A downside with only physical archives is that the data and knowledge stored here become less available and create a threshold for researchers to use the data since manually digitizing handwritten texts can be very time-consuming. At the end of the thesis, a software program was created which could automatically digitize and summarize handwritten texts to save time for researchers. The tool consists of (1) Google Cloud Vision API used to digitize a photo of handwritten text by using a convolutional neural network (CNN) and (2) the transformer-based algorithm GPT-3 used to summarize the digitized text. The GPT-3 algorithm provided two different engines, Davinci and Curie. The performance of the algorithms was evaluated with a data set consisting of handwritten texts provided by Stockholms University. The results indicated that the performance of Google Cloud Vision API was highly correlated to the quality of the image and the way of handwriting. With a unique handwriting follows a poor classification of letters since the algorithm performed badly on shapes that were unfamiliar. A survey was used to evaluate the performance of GPT-3. The survey got 73 responses where the subjects would grade five summaries conducted by a human and the GPT-3 engines Davinci and Curie respectively from the same text. The results from the survey indicated that the performance of the engine Davinci was comparable to the performance of a human while Curie was not a preferable option.
126

Fine-grained sentiment analysis of product reviews in Swedish

Westin, Emil January 2020 (has links)
In this study we gather customer reviews from Prisjakt, a Swedish price comparison site, with the goal to study the relationship between review and rating, known as sentiment analysis. The purpose of the study is to evaluate three different supervised machine learning models on a fine-grained dependent variable representing the review rating. For classification, a binary and multinomial model is used with the one-versus-one strategy implemented in the Support Vector Machine, with a linear kernel, evaluated with F1, accuracy, precision and recall scores. We use Support Vector Regression by approximating the fine-grained variable as continuous, evaluated using MSE. Furthermore, three models are evaluated on a balanced and unbalanced dataset in order to investigate the effects of class imbalance. The results show that the SVR performs better on unbalanced fine-grained data, with the best fine-grained model reaching a MSE 4.12, compared to the balanced SVR (6.84). The binary SVM model reaches an accuracy of 86.37% and weighted F1 macro of 86.36% on the unbalanced data, while the balanced binary SVM model reaches approximately 80% for both measures. The multinomial model shows the worst performance due to the inability to handle class imbalance, despite the implementation of class weights. Furthermore, results from feature engineering shows that SVR benefits marginally from certain regex conversions, and tf-idf weighting shows better performance on the balanced sets compared to the unbalanced sets.
127

Automatic Recognition and Classification of Translation Errors in Human Translation / Automatisk igenkänning och klassificering av fel i mänsklig översättning

Dürlich, Luise January 2020 (has links)
Grading assignments is a time-consuming part of teaching translation. Automatic tools that facilitate this task would allow teachers of professional translation to focus more on other aspects of their job. Within Natural Language Processing, error recognitionhas not been studied for human translation in particular. This thesis is a first attempt at both error recognition and classification with both mono- and bilingual models. BERT– a pre-trained monolingual language model – and NuQE – a model adapted from the field of Quality Estimation for Machine Translation – are trained on a relatively small hand annotated corpus of student translations. Due to the nature of the task, errors are quite rare in relation to correctly translated tokens in the corpus. To account for this,we train the models with both under- and oversampled data. While both models detect errors with moderate success, the NuQE model adapts very poorly to the classification setting. Overall, scores are quite low, which can be attributed to class imbalance and the small amount of training data, as well as some general concerns about the corpus annotations. However, we show that powerful monolingual language models can detect formal, lexical and translational errors with some success and that, depending on the model, simple under- and oversampling approaches can already help a great deal to avoid pure majority class prediction.
128

A comparative study of the grammatical gender systems of languages by means of analysing word embeddings

Veeman, Hartger January 2020 (has links)
The creation of word embeddings is one of the key breakthroughs in natural language processing. Word embeddings allow for words to be represented semantically, opening the way to many new deep learning methods. Understanding what information is in word embeddings will help understanding the behaviour of embeddings in natural language processing tasks, but also allows for the quantitative study of the linguistic features such as grammatical gender. This thesis attempts to explore how grammatical gender is encoded in word embeddings, through analysing the performance of a neural network classifier on the classification of nouns by gender. This analysis is done in three experiments: an analysis of contextualized embeddings, an analysis of embeddings learned from modified corpora and an analysis of aligned embeddings in many languages. The contextualized word embedding model ELMo has multiple output layers with a gradual increasing presence of semantic information in the embedding. This differing presence of semantic information was used to test the classifier's reliance on semantic information. Swedish, German, Spanish and Russian embeddings were classified at all layers of a three layered ELMo model. The word representation layer without any contextualization was found to produce the best accuracy, indicating the noise introduced by the contextualization was more impactful than any potential extra semantic information. Swedish embeddings were learned from a corpus stripped of articles and a stemmed corpus. Both sets of embeddings showed an drop of about 6% in accuracy in comparison with the embeddings from a non-augmented corpus, indicating agreement plays a large role in the classification. Aligned multilingual embeddings were used to measure the accuracy of a grammatical gender classifier in 24 languages. The classifier models were applied to data of other languages to determine the similarity of the encoding of grammatical gender in these embeddings. Correcting the results with a random guessing baseline shows that transferred models can be highly accurate in certain language combinations and in some cases almost approach the accuracy of the model on its source data. A comparison between transfer accuracy and phylogenetic distance showed that the model transferability follows a pattern that resembles the phylogenetic distance.
129

Anchor-based Topic Modeling with Human Interpretable Results / Tolkningsbara ämnesmodeller baserade på ankarord

Andersson, Henrik January 2020 (has links)
Topic models are useful tools for exploring large data sets of textual content by exposing a generative process from which the text was produced. Anchor-based topic models utilize the anchor word assumption to define a set of algorithms with provable guarantees which recover the underlying topics with a run time practically independent of corpus size. A number of extensions to the initial anchor word-based algorithms, and enhancements made to tangential models, have been proposed which improve the intrinsic characteristics of the model making them more interpretable by humans. This thesis evaluates improvements to human interpretability due to: low-dimensional word embeddings in combination with a regularized objective function, automatic topic merging using tandem anchors, and utilizing word embeddings to synthetically increase corpus density. Results show that tandem anchors are viable vehicles for automatic topic merging, and that using word embeddings significantly improves the original anchor method across all measured metrics. Combining low-dimensional embeddings and a regularized objective results in computational downsides with small or no improvements to the metrics measured.
130

Question Classification in Question Answering Systems

Sundblad, Håkan January 2007 (has links)
Question answering systems can be seen as the next step in information retrieval, allowing users to pose questions in natural language and receive succinct answers. In order for a question answering system as a whole to be successful, research has shown that the correct classification of questions with regards to the expected answer type is imperative. Question classification has two components: a taxonomy of answer types, and a machinery for making the classifications. This thesis focuses on five different machine learning algorithms for the question classification task. The algorithms are k nearest neighbours, naïve bayes, decision tree learning, sparse network of winnows, and support vector machines. These algorithms have been applied to two different corpora, one of which has been used extensively in previous work and has been constructed for a specific agenda. The other corpus is drawn from a set of users' questions posed to a running online system. The results showed that the performance of the algorithms on the different corpora differs both in absolute terms, as well as with regards to the relative ranking of them. On the novel corpus, naïve bayes, decision tree learning, and support vector machines perform on par with each other, while on the biased corpus there is a clear difference between them, with support vector machines being the best and naïve bayes being the worst. The thesis also presents an analysis of questions that are problematic for all learning algorithms. The errors can roughly be divided as due to categories with few members, variations in question formulation, the actual usage of the taxonomy, keyword errors, and spelling errors. A large portion of the errors were also hard to explain. / <p>Report code: LiU-Tek-Lic-2007:29.</p>

Page generated in 0.0537 seconds