• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 134
  • 5
  • 4
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 157
  • 91
  • 79
  • 67
  • 66
  • 49
  • 47
  • 46
  • 46
  • 45
  • 45
  • 44
  • 44
  • 41
  • 40
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Analyzing the Anisotropy Phenomenon in Transformer-based Masked Language Models / En analys av anisotropifenomenet i transformer-baserade maskerade språkmodeller

Luo, Ziyang January 2021 (has links)
In this thesis, we examine the anisotropy phenomenon in popular masked language models, BERT and RoBERTa, in detail. We propose a possible explanation for this unreasonable phenomenon. First, we demonstrate that the contextualized word vectors derived from pretrained masked language model-based encoders share a common, perhaps undesirable pattern across layers. Namely, we find cases of persistent outlier neurons within BERT and RoBERTa's hidden state vectors that consistently bear the smallest or largest values in said vectors. In an attempt to investigate the source of this information, we introduce a neuron-level analysis method, which reveals that the outliers are closely related to information captured by positional embeddings. Second, we find that a simple normalization method, whitening can make the vector space isotropic. Lastly, we demonstrate that ''clipping'' the outliers or whitening can more accurately distinguish word senses, as well as lead to better sentence embeddings when mean pooling.
62

Named-entity recognition with BERT for anonymization of medical records

Bridal, Olle January 2021 (has links)
Sharing data is an important part of the progress of science in many fields. In the largely deep learning dominated field of natural language processing, textual resources are in high demand. In certain domains, such as that of medical records, the sharing of data is limited by ethical and legal restrictions and therefore requires anonymization. The process of manual anonymization is tedious and expensive, thus automated anonymization is of great value. Since medical records consist of unstructured text, pieces of sensitive information have to be identified in order to be masked for anonymization. Named-entity recognition (NER) is the subtask of information extraction named entities, such as person names or locations, are identified and categorized. Recently, models that leverage unsupervised training on large quantities of unlabeled training data have performed impressively on the NER task, which shows promise in their usage for the problem of anonymization. In this study, a small set of medical records was annotated with named-entity tags. Because of the lack of any training data, a BERT model already fine-tuned for NER was then evaluated on the evaluation set. The aim was to find out how well the model would perform on NER on medical records, and to explore the possibility of using the model to anonymize medical records. The most positive result was that the model was able to identify all person names in the dataset. The average accuracy for identifying all entity types was however relatively low. It is discussed that the success of identifying person names shows promise in the model’s application for anonymization. However, because the overall accuracy is significantly worse than that of models fine-tuned on domain-specific data, it is suggested that there might be better methods for anonymization in the absence of relevant training data.
63

Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning

Hathurusinghe, Rajitha 16 September 2020 (has links)
This thesis explores the training of a deep neural network based named entity recognizer in an end-to-end privacy preserved setting where dataset creation and model training happen in an environment with minimal manual interventions. With the improvement of accuracy in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for training data for these models amidst the concerns on the data privacy. Several scenarios of data protection are suggested in the recent past due to public concerns hence the legal guidelines to enforce them. A promising new development is the decentralized model training on isolated datasets, which eliminates the compromises of privacy upon providing data to a centralized entity. However, in this federated setting curating the data source is still a privacy risk mostly in unstructured data sources such as text. We explore the feasibility of automatic dataset annotation for a Named Entity Recognition (NER) task and training a deep learning model with it in two federated learning settings. We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof- the-art deep learning language model for the downstream task of named entity recognition. We also explore this novel setting of deep learning NLP model and federated learning for its deviation from the classical centralized setting. We created an automatically annotated dataset containing around 80,000 sentences, a manual human annotated test set and tools to extend the dataset with more manual annotations. We observed the noise from automated annotation can be overcome to a level by increasing the dataset size. We also contributed to the federated learning framework with state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80 F1-score for recognition of entities in sentences.
64

Automatic Recognition and Classification of Translation Errors in Human Translation / Automatisk igenkänning och klassificering av fel i mänsklig översättning

Dürlich, Luise January 2020 (has links)
Grading assignments is a time-consuming part of teaching translation. Automatic tools that facilitate this task would allow teachers of professional translation to focus more on other aspects of their job. Within Natural Language Processing, error recognitionhas not been studied for human translation in particular. This thesis is a first attempt at both error recognition and classification with both mono- and bilingual models. BERT– a pre-trained monolingual language model – and NuQE – a model adapted from the field of Quality Estimation for Machine Translation – are trained on a relatively small hand annotated corpus of student translations. Due to the nature of the task, errors are quite rare in relation to correctly translated tokens in the corpus. To account for this,we train the models with both under- and oversampled data. While both models detect errors with moderate success, the NuQE model adapts very poorly to the classification setting. Overall, scores are quite low, which can be attributed to class imbalance and the small amount of training data, as well as some general concerns about the corpus annotations. However, we show that powerful monolingual language models can detect formal, lexical and translational errors with some success and that, depending on the model, simple under- and oversampling approaches can already help a great deal to avoid pure majority class prediction.
65

Community Recommendation in Social Networks with Sparse Data

Rahmaniazad, Emad 12 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Recommender systems are widely used in many domains. In this work, the importance of a recommender system in an online learning platform is discussed. After explaining the concept of adding an intelligent agent to online education systems, some features of the Course Networking (CN) website are demonstrated. Finally, the relation between CN, the intelligent agent (Rumi), and the recommender system is presented. Along with the argument of three different approaches for building a community recommendation system. The result shows that the Neighboring Collaborative Filtering (NCF) outperforms both the transfer learning method and the Continuous bag-of-words approach. The NCF algorithm has a general format with two various implementations that can be used for other recommendations, such as course, skill, major, and book recommendations.
66

Using Bert To Measure Objective Quality Of Rest-Api Specifications : Automated Approach For Quality Measurement

Eriksson, Fritz, Åkesson, Max January 2023 (has links)
Each day, the need for as well as the amount of network-based applications grows and with it the implementation of RESTful APIs. For all these APIs there is a need for documentation of the API's behavior, its benefits, how it interacts with other APIs, and its expected result. To solve this; An API specification is constructed. This is a document containing the design philosophy of the APIs and can act as a guideline for how they should be constructed. When designing API specifications it is often difficult to understand what objective quality the API document upholds. To understand the objective quality of an API specification it must first be understood what a good objective quality is in this regard. We used static code tests (linter rules) that are mapped to three quality attributes that fit the industry's consensus of the most important quality attributes that need to be complacent for a good quality API. We then implemented an automatic process of splitting API specifications into positive and negative training data using the linter results of the rules. The resulting data is used to train our BERT model.The model will then be able to give an objective score to unseen API specifications. We then used a saliency map (textual heatmap) in order to understand BERT's decisions, which added the potential to generate new linter rules from the given results. After testing unseen API specifications on our BERT model, we saw that it was able to generate a reasonable quality score. Although, when inserting smaller features to generate a textual heatmap, the predictions of our model were not correct, hence not making it possible to understand BERT's decisions through our implementation. This also meant that new rules could not be acquired from reviewing the BERT's result.
67

Towards Building a Versatile Tool for Social Media Spam Detection

Abdel Halim, Jalal 15 June 2023 (has links)
No description available.
68

W2R: an ensemble Anomaly detection model inspired by language models for web application firewalls security

Wang, Zelong, AnilKumar, Athira January 2023 (has links)
Nowadays, web application attacks have increased tremendously due to the large number of users and applications. Thus, industries are paying more attention to using Web application Firewalls and improving their security which acts as a shield between the app and the internet by filtering and monitoring the HTTP traffic. Most works focus on either traditional feature extraction or deep methods that require no feature extraction method. We noticed that a combination of an unsupervised language model and a classic dimension reduction method is less explored for this problem. Inspired by this gap, we propose a new unsupervised anomaly detection model with better results than the existing state-of-the-art model for anomaly detection in WAF security. This paper focuses on this structure to explore WAF security: 1) feature extraction from HTTP traffic packets by using NLP (natural language processing) methods such as word2vec and Bert, and 2) Dimension reduction by PCA and Autoencoder, 3) Using different types of anomaly detection techniques including OCSVM, isolation forest, LOF and combination of these algorithms to explore how these methods affect results.  We used the datasets CSIC 2010 and ECML/PKDD 2007 in this paper, and the model has better results.
69

An End-to-End Native Language Identification Model without the Need for Manual Annotation / En modersmålsidentifiering modell utan behov av manuell annotering

Buzaitė, Viktorija January 2022 (has links)
Native language identification (NLI) is a classification task which identifies the mother tongue of a language learner based on spoken or written material. The task gained popularity when it was featured in the 2017 BEA-12-workshop and since then many applications have been successfully found for NLI - ranging from language learning to authorship identification and forensic science. While a considerable amount of research has already been done in this area, we introduce a novel approach of incorporating syntactic information into the implementation of a BERT-based NLI model. In addition, we train separate models to test whether erroneous input sequences perform better than corrected sequences. To answer these questions we carry out both a quantitative and qualitative analysis. In addition, we test our idea of implementing a BERT-based GEC model to supply more training data to our NLI model without the need for manual annotation. Our results suggest that our models do not outperform the SVM baseline, but we attribute this result to the lack of training data in our dataset, as transformer-based architectures like BERT need huge amounts of data to be successfully fine-tuned. In turn, simple linear models like SVM perform well on small amounts of data. We also find that erroneous structures in data come useful when combined with syntactic information but neither boosts the performance of NLI model separately. Furthermore, our implemented GEC system performs well enough to produce more data for our NLI models, as their scores increase after implementing the additional data, resulting from our second experiment. We believe that our proposed architecture is potentially suitable for the NLI task if we incorporate extensions which we suggest in the conclusion section.
70

Analysing the possibilities of a needs-based house configurator

Ermolaev, Roman January 2023 (has links)
A needs-based configurator is a system or tool that assists users in customizing products based on their specific needs. This thesis investigates the challenges of obtaining data for a needs-based machine learning house configurator and identifies suitable models for its implementation. The study consists of two parts: first, an analysis of how to obtain data, and second, an evaluation of three models for implementing the needs-based solution. The analysis shows that collecting house review data for a needs-based configurator is challenging due to several factors, including how the housing market operates compared to other markets, privacy concerns, and the complexity of the buying process. To address this, future studies could consider alternative data sources, adding contextual data, and creating surveys or questionnaires. The evaluation of three models: DistilBERT, BERT fine-tuned for Swedish, and a CNN with a Swedish word embedding layer, shows that both the BERT models perform well on the generated dataset, while the CNN model underperformed. The Swedish BERT model performed the best, achieving high recall and precision metrics for k between 2 and 5. This thesis suggests that further research on needs-based configurators should focus on alternative data sources and more extensive datasets to improve performance.

Page generated in 0.0379 seconds