Global ETD Search

141	Cleartext detection and language identification in ciphers Gambardella, Maria-Elena January 2021 (has links) In historical cryptology, cleartext represents text written in a known language ina cipher (a hand-written manuscript aiming at hiding the content of a message).Cleartext can give us an historical interpretation and contextualisation of themanuscript and could help researchers in cryptanalysis, but to these days thereis still no research on how to automatically detect cleartext and identifying itslanguage. In this paper, we investigate to what extent we can automaticallydistinguish cleartext from ciphertext in transcribed historical ciphers and towhat extent we are able to identify its language. We took a rule-based approachand run 7 different models using historical language models on ciphertextsprovided by the DECRYPT-Project. Our results show that using unigrams andbigrams on a word-level combined with 3-grams, 4-grams and 5-grams on acharacter-level is the best approach to tackle cleartext detection. historical cryptology digital humanities
142	Attacking and Defending the Privacy of Clinical Language Models Vakili, Thomas January 2023 (has links) The state-of-the-art methods in natural language processing (NLP) increasingly rely on large pre-trained transformer models. The strength of the models stems from their large number of parameters and the enormous amounts of data used to train them. The datasets are of a scale that makes it difficult, if not impossible, to audit them manually. When unwieldy amounts of potentially sensitive data are used to train large machine learning models, a difficult problem arises: the unintended memorization of the training data. All datasets—including those based on publicly available data—can contain sensitive information about individuals. When models unintentionally memorize these sensitive data, they become vulnerable to different types of privacy attacks. Very few datasets for NLP can be guaranteed to be free from sensitive data. Thus, to varying degrees, most NLP models are susceptible to privacy leakage. This susceptibility is especially concerning in clinical NLP, where the data typically consist of electronic health records. Unintentionally leaking publicly available data can be problematic, but leaking data from electronic health records is never acceptable from a privacy perspective. At the same time, clinical NLP has great potential to improve the quality and efficiency of healthcare. This licentiate thesis investigates how these privacy risks can be mitigated using automatic de-identification. This is done by exploring the privacy risks of pre-training using clinical data and then evaluating the impact on the model accuracy of decreasing these risks. A BERT model pre-trained using clinical data is subjected to a training data extraction attack. The same model is also used to evaluate a membership inference attack that has been proposed to quantify the privacy risks associated with masked language models. Then, the impact of automatic de-identification on the performance of BERT models is evaluated for both pre-training and fine-tuning data. The results show that extracting training data from BERT models is non-trivial and suggest that the risks can be further decreased by automatically de-identifying the training data. Automatic de-identification is found to preserve the utility of the data used for pre-training and fine-tuning BERT models, resulting in no reduction in performance compared to models trained using unaltered data. However, we also find that the current state-of-the-art membership inference attacks are unable to quantify the privacy benefits of automatic de-identification. The results show that automatic de-identification reduces the privacy risks of using sensitive data for NLP without harming the utility of the data, but that these privacy benefits may be difficult to quantify. / Den språkteknologiska forskningen blir alltmer beroende av stora förtränade transformermodeller. Dessa kraftfulla språkmodeller utgörs av ett stort antal parametrar som tränas genom att bearbeta enorma datamängder. Träningsdatan är typiskt av en sådan omfattning att det är svårt – om inte omöjligt – att granska dem manuellt. När otympliga mängder av potentiellt känsliga data används för att träna stora språkmodeller uppstår ett svårhanterligt fenomen: oavsiktlig memorering. Väldigt få datakällor är helt fria från känsliga personuppgifter. Eftersom stora språkmodeller visat sig memorera detaljer om sina träningsdata gör det dem sårbara för integritetsröjande attacker. Denna sårbarhet är särskilt oroväckande inom klinisk språkteknologi, där data typiskt utgörs av elektroniska patientjournaler. Det är problematiskt att röja personuppgifter även om de är offentliga, men att läcka information från en individs patientjournaler är en oacceptabel integritetskränkning. Samtidigt så har klinisk språkteknologi stor potential att både förbättra kvalitén och öka effektiviteten inom sjukvården. Denna licentiatavhandling undersöker hur de nyss nämnda integritetsriskerna kan minskas med hjälp av automatisk avidentifiering. Detta undersöks genom att först utforska riskerna med att förträna språkmodeller med kliniska träningsdata och sedan jämföra hur modellernas tillförlitlighet och prestanda påverkas av att dessa risker minskas. En BERT-modell som förtränats med kliniska data utsätts för en attack som syftar till att extrahera träningsdata. Samma modell används också för att utvärdera en föreslagen metod för att kvantifiera integritetsrisker hos maskade språkmodeller och som baseras på modellernas mottaglighet för medlemskapsinferensattacker. Därefter utvärderas hur användbara automatiskt avidentifierade data är för att förträna BERT-modeller och för att träna dem att lösa specifika språkteknologiska problem. Resultaten visar att det är icke-trivialt att extrahera träningsdata ur språkmodeller. Samtidigt kan de risker som ändå finns minskas genom att automatiskt avidentifiera modellernas träningsdata. Därtill visar resultaten att språkmodeller tränade med automatiskt avidentifierade data fungerar lika väl som de som tränats med känsliga data. Detta gäller både vid förträning och vid träning för specifika problem. Samtidigt visar experimenten med medlemskapsinferens att nuvarande metoder inte fångar integritetsfördelarna av att automatiskt avidentifiera träningsdata. Sammanfattningsvis visar denna avhandling att automatisk avidentifiering kan användas för att minska de integritetsrisker som kommer av att använda känsliga data samtidigt som deras användbarhet bibehålls. Än saknas dock vedertagna metoder för att kvantifiera dessa integritetsvinster. Computer Sciences Datavetenskap (datalogi)
143	Multilingual Transformer Models for Maltese Named Entity Recognition Farrugia, Kris January 2022 (has links) The recently developed state-of-the-art models for Named Entity Recognition are heavily dependent upon huge amounts of available annotated data. Consequently, it is extremely challenging for data-scarce languages to obtain significant result. Several approaches have been proposed to circumvent this issue, including cross-lingual transfer learning, which is the leveraging of knowledge obtained by available resources in the source language and transfer it to a target low-resource language. Maltese is one of the many majorly underresourced languages. The main purpose of this project is to research how recently developed transformer multilingual models (Multilingual BERT and XLM-RoBERTa) perform and to ultimately set up an evaluation benchmark in zero-shot cross-lingual transfer learning for Maltese Named Entity Recognition. The models are fine-tuned on Arabic, English, Italian, Spanish and Dutch. The experiments evaluated the efficacy of the source languages and the use of multilingual data in both the training and validation stages. The experiments demonstrated that feeding multilingual data to both the training and the validation phases was mostly beneficial to the performance. However, adding it to the validation phase only was generally detrimental. Furthermore, XLM-R achieved overall better scores however, employing mBERT and English as the source language yielded the best performance. low-resource named-entity information extraction Maltese
144	Towards Latent Space Disentanglement of Variational AutoEncoders for Language García de Herreros García, Paloma January 2022 (has links) Variational autoencoders (VAEs) are a neural network architecture broadly used in image generation (Doersch 2016). VAEs are neural network models that encode data from some domain and project it into a latent space (Doersch 2016). In doing so, the resulting encoding space goes from being a discrete distribution of vectors to a series of continuous manifolds. The latent space is subject to a Gaussian prior, giving the space some convenient properties for the distribution of said manifolds. Several strategies have been presented to try to disentangle said latent space to force each of their dimensions to have an interpretable meaning, for example, 𝛽-VAE, Factor-VAE, 𝛽-TCVAE. In this thesis, some previous VAE models for NaturalLanguage Processing (like Park and Lee (2021), and Bowman et al. (2015), where they finetune pretrained transformer models so they behave as VAEs, and where they used recurrent neural network language model to create a VAEs model that generates sentences in the continuous latent space, respectively) are combined with these disentangling techniques, to show if we can find any understandable meaning in the associated dimensions. The obtained results indicate that the techniques cannot be applied to text-based data without causing the model to suffer from posterior collapse. Variational Autoencoders Latent Space disentanglement
145	Deep Text Mining of Instagram Data Without Strong Supervision / Textutvinning från Instagram utan Precis Övervakning Hammar, Kim January 2018 (has links) With the advent of social media, our online feeds increasingly consist of short, informal, and unstructured text. This data can be analyzed for the purpose of improving user recommendations and detecting trends. The grand volume of unstructured text that is available makes the intersection of text processing and machine learning a promising avenue of research. Current methods that use machine learning for text processing are in many cases dependent on annotated training data. However, considering the heterogeneity and variability of social media, obtaining strong supervision for social media data is in practice both difficult and expensive. In light of this limitation, a belief that has put its marks on this thesis is that the study of text mining methods that can be applied without strong supervision is of a higher practical interest. This thesis investigates unsupervised methods for scalable processing of text from social media. Particularly, the thesis targets a classification and extraction task in the fashion domain on the image-sharing platform Instagram. Instagram is one of the largest social media platforms, containing both text and images. Still, research on text processing in social media is to a large extent limited to Twitter data, and little attention has been paid to text mining of Instagram data. The aim of this thesis is to broaden the scope of state-of-the-art methods for information extraction and text classification to the unsupervised setting, working with informal text on Instagram. Its main contributions are (1) an empirical study of text from Instagram; (2) an evaluation of word embeddings for Instagram text; (3) a distributed implementation of the FastText algorithm; (4) a system for fashion attribute extraction in Instagram using word embeddings; and (5) a multi-label clothing classifier for Instagram text, built with deep learning techniques and minimal supervision. The empirical study demonstrates that the text distribution on Instagram exhibits the long-tail phenomenon, that the text is just as noisy as have been reported in studies on Twitter text, and that comment sections are multi-lingual. In experiments with word embeddings for Instagram, the importance of hyperparameter tuning is manifested and a mismatch between pre-trained embeddings and social media is observed. Furthermore, that word embeddings are a useful asset for information extraction is confirmed. Experimental results show that word embeddings beats a baseline that uses Levenshtein distance on the task of extracting fashion attributes from Instagram. The results also show that the distributed implementation of FastText reduces the time it takes to train word embeddings with a factor that scales with the number of machines used for training. Finally, our research demonstrates that weak supervision can be used to train a deep classifier, achieving an F1 score of 0.61 on the task of classifying clothes in Instagram posts based only on the associated text, which is on par with human performance. / I och med uppkomsten av sociala medier så består våra online-flöden till stor del av korta och informella textmeddelanden, denna data kan analyseras med syftet att upptäcka trender och ge användarrekommendationer. Med tanke på den stora volymen av ostrukturerad text som finns tillgänglig så är kombinationen av språkteknologi och maskinlärning ett forskningsområde med stor potential. Nuvarande maskinlärningsteknologier för textbearbetning är i många fall beroende av annoterad data för träning. I praktiken så är det dock både komplicerat och dyrt att anskaffa annoterad data av hög kvalitet, inte minst vad gäller data från sociala medier, med tanke på hur pass föränderlig och heterogen sociala medier är som datakälla. En övertygelse som genomsyrar denna avhandling är att textutvinnings metoder som inte kräver precis övervakning har större potential i praktiken. Denna avhandling undersöker oövervakade metoder för skalbar bearbetning av text från sociala medier. Specifikt så täcker avhandlingen ett komplext klassifikations- och extraktions- problem inom modebranschen på bilddelningsplattformen Instagram. Instagram hör till de mest populära sociala plattformarna och innehåller både bilder och text. Trots det så är forskning inom textutvinning från sociala medier till stor del begränsad till data från Twitter och inte mycket uppmärksamhet har givits de stora möjligheterna med textutvinning från Instagram. Ändamålet med avhandlingen är att förbättra nuvarande metoder som används inom textklassificering och informationsextraktion, samt göra dem applicerbara för oövervakad maskinlärning på informell text från Instagram. De primära forskningsbidragen i denna avhandling är (1) en empirisk studie av text från Instagram; (2) en utvärdering av ord-vektorer för användning med text från Instagram; (3) en distribuerad implementation av FastText algoritmen; (4) ett system för extraktion av kläddetaljer från Instagram som använder ord-vektorer; och (5) en flerkategorisk kläd-klassificerare för text från Instagram, utvecklad med djupinlärning och minimal övervakning. Den empiriska studien visar att textdistributionen på Instagram har en lång svans, att texten är lika informell som tidigare rapporterats från studier på Twitter, samt att kommentarssektionerna är flerspråkiga. Experiment med ord-vektorer för Instagram understryker vikten av att justera parametrar före träningsprocessen, istället för att använda förbestämda värden. Dessutom visas att ord-vektorer tränade på formell text är missanpassade för applikationer som bearbetar informell text. Vidare så påvisas att ord-vektorer är effektivt för informationsextraktion i sociala medier, överlägsen ett standardvärde framtaget med informationsextraktion baserat på syntaktiskt ordlikhet. Resultaten visar även att den distribuerade implementationen av FastText kan minska tiden det tar att träna ord-vektorer med en faktor som beror på antalet maskiner som används i träningen. Slutligen, vår forskning indikerar att svag övervakning kan användas för att träna en klassificerare med djupinlärning. Den tränade klassificeraren uppnår ett F1 resultat av 0.61 på uppgiften att klassificera kläddetaljer av bilder från Instagram, baserat endast på bildtexten och tillhörande användarkommentarer, vilket är i nivå med mänsklig förmåga. Natural Language Processing Information Extraction Machine Learning Språkteknologi Informationsextraktion Maskinlärning Computer Systems Datorsystem
146	A Survey of Non-Projective Dependencies and a Novel Approach to Projectivization for Parsing Decatur, James January 2022 (has links) Non-projective dependencies remain an at large issue in the field of dependency parsing. Regardless of what parsing algorithm is used, researchers run into the issue of computational speed and lower parsing performance on non-projective dependencies than on projective dependencies. Through a better understanding of non-projectivity, we may be able to address both issues. This thesis is aimed to discover what types of non-projective dependencies are prevalent in the three languages English, German, and Czech. Moreover, this thesis is aimed to define and create a linguistically informed projectivization scheme and to find out the extent to which the scheme improves upon the performance of the baseline parser. In order to achieve these aims, the eight most frequently occurring non-projective dependencies in English, German, and Czech were surveyed. This means that the causes of their non-projectivity were discovered, the structures of the non-projective dependencies were analyzed, and generalizations and comparisons between non-projective dependencies were made. After the survey, an attempt to define and create a linguistically informed projectivization scheme was made. The goals were not only to projectivize the non-projective relations but to do so by assigning the closest possible new parent in the sentence to the non-projective child and to minimize the number of projectivization transformations that needed to be made. Although the survey of the non-projective dependencies yielded good results, as we were able to identify that the causes of the more frequently occurring non-projective dependencies in German and Czech were the same and the structures of them the same as well, we reached no solid conclusion on how a linguistically informed projectivization scheme could be defined, as further research is needed. However, the novel projectivization scheme we did come up with managed to marginally outperform the baseline parser in English and German, and moderately outperform the baseline parser in Czech which is the language with the most non-projective dependencies of the group. dependency parsing non-projective dependencies projectivization
147	In-Domain and Cross-Domain Classification of Patronizing and Condescending Language in Social Media and News Texts : A Study in Implicitly Aggressive Language Detection and Methods Ortiz, Flor January 2022 (has links) The field of aggressive language detection is developing quickly in Natural Language Processing. However, most of the work being done in this field is centered around explicitly aggressive language, whereas work exploring forms of implicitly aggressive language is much less prolific thus far. Further, there are many subcategories that are encompassed within the greater category of implicitly aggressive language, for example, condescending and patronizing language. This thesis focuses on the relatively new field of patronizing and condescending language (PCL) detection, specifically on expanding away from in-domain tasks that focus on either news or social media texts. Cross-domain patronizing and condescending language detection is as of today not a widely explored sub-field of Natural Language Processing. In this project, the aim to answer three main research questions: the first is to what extent do models trained to detect patronizing and condescending language in one domain, in this case social media texts and news publications, generalize to other domains. Secondly, we aim to make advances toward a baseline for balanced PCL datasets and compare performance across label distribution ratios. Thirdly, we aim to address the impact of a common feature in patronizing and condescending language datasets--the significant imbalance between negative and positive labels in the binary classification task. To this end, we aim to address the question of to what extent does the proportion between labels have an impact on the in-domain PCL classification task. We find that the best performing model for the in-domain classification task is the Gradient Boosting classifier trained on an imbalanced dataset harvested from Reddit, which included both the post and the reply, with a ratio of 1:2 between positive and negative labels. In the cross-domain task, we find that the best performing model is an SVM trained on the balanced news dataset and evaluated on the balanced Reddit post and reply dataset. In the latter study, we show that it is possible to achieve competitive results using classical machine models on a nuanced, context-dependent binary classification task. PCL Patronizing Language Condescension
148	Low-Resource Domain Adaptation for Jihadi Discourse : Tackling Low-Resource Domain Adaptation for Neural Machine Translation Using Real and Synthetic Data Tollersrud, Thea January 2023 (has links) In this thesis, I explore the problem of low-resource domain adaptation for jihadi discourse. Due to the limited availability of annotated parallel data, developing accurate and effective models in this domain poses a challenging task. To address this issue, I propose a method that leverages a small in-domain manually created corpus and a synthetic corpus created from monolingual data using back-translation. I evaluate the approach by fine-tuning a pre-trained language model on different proportions of real and synthetic data and measuring its performance on a held-out test set. My experiments show that fine-tuning a model on one-fifth real parallel data and synthetic parallel data effectively reduces occurrences of over-translation and bolsters the model's ability to translate in-domain terminology. My findings suggest that synthetic data can be a valuable resource for low-resource domain adaptation, especially when real parallel data is difficult to obtain. The proposed method can be extended to other low-resource domains where annotated data is scarce, potentially leading to more accurate models and better translation of these domains. machine translation domain adaptation
149	How do voiceprints age? Nachesa, Maya Konstantinovna January 2023 (has links) Voiceprints, like fingerprints, are a biometric. Where fingerprints record a person's unique pattern on their finger, voiceprints record what a person's voice "sounds like", abstracting away from what the person said. They have been used in speaker recognition, including verification and identification. In other words, they have been used to ask "is this speaker who they say they are?" or "who is this speaker?", respectively. However, people age, and so do their voices. Do voiceprints age, too? That is, can a person's voice change enough that after a while, the original voiceprint can no longer be used to identify them? In this thesis, I use Swedish audio recordings from Riksdagen's (the Swedish parliament) debate speeches to test this idea. Depending on the answer, this influences how well we can search the database for previously unmarked speeches. I find that speaker verification performance decreases as the age-gap between voiceprints increases, and that it decreases more strongly after roughly five years. Additionally, I grouped the speakers into age groups spanning five years, and found that speaker verification has the highest performance for those for whom the initial voiceprint was recorded from 29-33 years of age. Additionally, longer input speech provides higher quality voiceprints, with performance improvements stagnating when voiceprints become longer than 30 seconds. Finally, voiceprints for men age more strongly than those for women after roughly 5 years. I also investigated how emotions are encoded in voiceprints, since this could potentially impede in speaker recognition. I found that it is possible to train a classifier to recognise emotions from voiceprints, and that this classifier does better when recognising emotions from known speakers. That is, emotions are encoded more characteristically per person as opposed to per emotion itself. As such, they are unlikely to interfere with speaker recognition. Voiceprint Speaker Emotion Recognition Age Speaker Verification
150	A comparative study of automatic text summarization using human evaluation and automatic measures / En jämförande studie av automatisk textsammanfattning med användning av mänsklig utvärdering och automatiska mått Wennstig, Maja January 2023 (has links) Automatic text summarization has emerged as a promising solution to manage the vast amount of information available on the internet, enabling a wider audience to access it. Nevertheless, further development and experimentation with different approaches are still needed. This thesis explores the potential of combining extractive and abstractive approaches into a hybrid method, generating three types of summaries: extractive, abstractive, and hybrid. The news articles used in the study are from the Swedish newspaper Dagens Nyheter(DN). The quality of the summaries is assessed using various automatic measures, including ROUGE, BERTScore, and Coh-Metrix. Additionally, human evaluations are conducted to compare the different types of summaries in terms of perceived fluency, adequacy, and simplicity. The results of the human evaluation show a statistically significant difference between attractive, abstractive, and hybrid summaries with regard to fluency, adequacy, and simplicity. Specifically, there is a significant difference between abstractive and hybrid summaries in terms of fluency and simplicity, but not in adequacy. The automatic measures, however, do not show significant differences between the different summaries but tend to give higher scores to the hybrid and abstractive summaries Extractive summarization Abstractive summarization Hybrid summarization Automatic text summarization

Search results