Global ETD Search

11	Zpracování češtiny s využitím kontextualizované reprezentace / Czech NLP with Contextualized Embeddings Vysušilová, Petra January 2021 (has links) With the increasing amount of digital data in the form of unstructured text, the importance of natural language processing (NLP) increases. The most suc- cessful technologies of recent years are deep neural networks. This work applies the state-of-the-art methods, namely transfer learning of Bidirectional Encoders Representations from Transformers (BERT), on three Czech NLP tasks: part- of-speech tagging, lemmatization and sentiment analysis. We applied BERT model with a simple classification head on three Czech sentiment datasets: mall, facebook, and csfd, and we achieved state-of-the-art results. We also explored several possible architectures for tagging and lemmatization and obtained new state-of-the-art results in both tagging and lemmatization with fine-tunning ap- proach on data from Prague Dependency Treebank. Specifically, we achieved accuracy 98.57% for tagging, 99.00% for lemmatization, and 98.19% for joint accuracy of both tasks. Best models for all tasks are publicly available. 1
12	Predictive maintenance using NLP and clustering support messages Yilmaz, Ugur January 2022 (has links) Communication with customers is a major part of customer experience as well as a great source of data mining. More businesses are engaging with consumers via text messages. Before 2020, 39% of businesses already use some form of text messaging to communicate with their consumers. Many more were expected to adopt the technology after 2020[1]. Email response rates are merely 8%, compared to a response rate of 45% for text messaging[2]. A significant portion of this communication involves customer enquiries or support messages sent in both directions. According to estimates, more than 80% of today’s data is stored in an unstructured format (suchas text, image, audio, or video) [3], with a significant portion of it being stated in ambiguous natural language. When analyzing such data, qualitative data analysis techniques are usually employed. In order to facilitate the automated examination of huge corpora of textual material, researchers have turned to natural language processing techniques[4]. Under the light of shared statistics above, Billogram[5] has decided that support messages between creditors and recipients can be mined for predictive maintenance purposes, such as early identification of an outlier like a bug, defect, or wrongly built feature. As one sentence goal definition, Billogram is looking for an answer to ”why are people reaching out to begin with?” This thesis project discusses implementing unsupervised clustering of support messages by benefiting from natural language processing methods as well as performance metrics of results to answer Billogram’s question. The research also contains intent recognition of clustered messages in two different ways, one automatic and one semi-manual, the results have been discussed and compared. LDA and manual intent assignment approach of the first research has 100 topics and a 0.293 coherence score. On the other hand, the second approach produced 158 clusters with UMAP and HDBSCAN while intent recognition was automatic. Creating clusters will help identifying issues which can be subjects of increased focus, automation, or even down-prioritizing. Therefore, this research lands in the predictive maintenance[9] area. This study, which will get better over time with more iterations in the company, also contains the preliminary work for ”labeling” or ”describing”clusters and their intents. Predictive maintenance support messages NLP unsupervised clustering intent recognition LDA UMAP HDBSCAN BERT Swedish BERT(KB-BERT) Billogram
13	The Concept of Descent in "Le Tombeau des Rois" as Developed in Kamouraska Good, Ewan January 2009 (has links) (PDF) No description available. Self-perception in literature
14	Multi-Perspective Semantic Information Retrieval in the Biomedical Domain January 2020 (has links) abstract: Information Retrieval (IR) is the task of obtaining pieces of data (such as documents or snippets of text) that are relevant to a particular query or need from a large repository of information. IR is a valuable component of several downstream Natural Language Processing (NLP) tasks, such as Question Answering. Practically, IR is at the heart of many widely-used technologies like search engines. While probabilistic ranking functions, such as the Okapi BM25 function, have been utilized in IR systems since the 1970's, modern neural approaches pose certain advantages compared to their classical counterparts. In particular, the release of BERT (Bidirectional Encoder Representations from Transformers) has had a significant impact in the NLP community by demonstrating how the use of a Masked Language Model (MLM) trained on a considerable corpus of data can improve a variety of downstream NLP tasks, including sentence classification and passage re-ranking. IR Systems are also important in the biomedical and clinical domains. Given the continuously-increasing amount of scientific literature across biomedical domain, the ability find answers to specific clinical queries from a repository of millions of articles is a matter of practical value to medics, doctors, and other medical professionals. Moreover, there are domain-specific challenges present in the biomedical domain, including handling clinical jargon and evaluating the similarity or relatedness of various medical symptoms when determining the relevance between a query and a sentence. This work presents contributions to several aspects of the Biomedical Semantic Information Retrieval domain. First, it introduces Multi-Perspective Sentence Relevance, a novel methodology of utilizing BERT-based models for contextual IR. The system is evaluated using the BioASQ Biomedical IR Challenge. Finally, practical contributions in the form of a live IR system for medics and a proposed challenge on the Living Systematic Review clinical task are provided. / Dissertation/Thesis / Masters Thesis Computer Science 2020 Computer science BERT information retrieval natural language processing
15	Neural Dependency Parsing of Low-resource Languages: A Case Study on Marathi Zhang, Wenwen January 2022 (has links) Cross-lingual transfer has been shown effective for dependency parsing of some low-resource languages. It typically requires closely related high-resource languages. Pre-trained deep language models significantly improve model performance in cross-lingual tasks. We evaluate cross-lingual model transfer on parsing Marathi, a low-resource language that does not have a closely related highresource language. In addition, we investigate monolingual modeling for comparison. We experiment with two state-of-the-art language models: mBERT and XLM-R. Our experimental results illustrate that the cross-lingual model transfer approach still holds with distantly related source languages, and models benefit most from XLM-R. We also evaluate the impact of multi-task learning by training all UD tasks simultaneously and find that it yields mixed results for dependency parsing and degrades the transfer performance of the best performing source language Ancient Greek. Humanities and the Arts Humaniora och konst
16	Unsupervised multilingual distractor generation for fill-in-the-blank questions Han, Zhe January 2022 (has links) Fill-in-the-blank multiple choice questions (MCQs) play an important role in the educational field, but the manual generation of them is quite resource-consuming, so it has gradually turned into an attractive NLP task. Thereinto, question creation itself has become a mainstream NLP research topic, while distractor (wrong alternative) generation (DG) still remains out of the spotlight. Although several studies on distractor generation have been conducted in recent years, there is little previous work on languages other than English. The goal of this thesis is to generate multilingual distractors in Chinese, Arabic, German, and English across domains. The initial step is to construct small-sized multilingual scientific datasets (En, Zh, Ar, and De) and general datasets (Zh and Ar) from scratch. Considering that there are limited multilingual labelled datasets, unsupervised experiments based on WordNet, Word Embedding, transformer-based models, translation methods, and domain adaptation are conducted to generate their corresponding candidate distractors. Finally, the performance of methods is evaluated against our newly-created datasets, where three metrics are applied. Lastly, statistical results show that monolingual transformer-based together with translation-based methods outperform the rest of the approaches for multilingual datasets, except for German, which reaches its highest score only through the translation-based means, and distractor generation in English datasets is the simplest to implement, whereas it is the most difficult in Arabic datasets. Multilingual Distractor BERT
17	A Language-Model-Based Approach for Detecting Incompleteness in Natural-Language Requirements Luitel, Dipeeka 24 May 2023 (has links) [Context and motivation]: Incompleteness in natural-language requirements is a challenging problem. [Question/Problem]: A common technique for detecting incompleteness in requirements is checking the requirements against external sources. With the emergence of language models such as BERT, an interesting question is whether language models are useful external sources for finding potential incompleteness in requirements. [Principal ideas/results]: We mask words in requirements and have BERT's masked language model (MLM) generate contextualized predictions for filling the masked slots. We simulate incompleteness by withholding content from requirements and measure BERT's ability to predict terminology that is present in the withheld content but absent in the content disclosed to BERT. [Contributions]: BERT can be configured to generate multiple predictions per mask. Our first contribution is to determine how many predictions per mask is an optimal trade-off between effectively discovering omissions in requirements and the level of noise in the predictions. Our second contribution is devising a machine learning-based filter that post-processes predictions made by BERT to further reduce noise. We empirically evaluate our solution over 40 requirements specifications drawn from the PURE dataset [30]. Our results indicate that: (1) predictions made by BERT are highly effective at pinpointing terminology that is missing from requirements, and (2) our filter can substantially reduce noise from the predictions, thus making BERT a more compelling aid for improving completeness in requirements. BERT Natural Language Processing Machine Learning Language Models
18	DEMOCRATISING DEEP LEARNING IN MICROBIAL METABOLITES RESEARCH / DEMOCRATISING DEEP LEARNING IN NATURAL PRODUCTS RESEARCH Dial, Keshav January 2023 (has links) Deep learning models are dominating performance across a wide variety of tasks. From protein folding to computer vision to voice recognition, deep learning is changing the way we interact with data. The field of natural products, and more specifically genomic mining, has been slow to adapt to these new technological innovations. As we are in the midst of a data explosion, it is not for lack of training data. Instead, it is due to the lack of a blueprint demonstrating how to correctly integrate these models to maximise performance and inference. During my PhD, I showcase the use of large language models across a variety of data domains to improve common workflows in the field of natural product drug discovery. I improved natural product scaffold comparison by representing molecules as sentences. I developed a series of deep learning models to replace archaic technologies and create a more scalable genomic mining pipeline decreasing running times by 8X. I integrated deep learning-based genomic and enzymatic inference into legacy tooling to improve the quality of short-read assemblies. I also demonstrate how intelligent querying of multi-omic datasets can be used to facilitate the gene signature prediction of encoded microbial metabolites. The models and workflows I developed are wide in scope with the hopes of blueprinting how these industry standard tools can be applied across the entirety of natural product drug discovery. / Thesis / Doctor of Philosophy (PhD) Deep Learning Cheminformatics BERT LLM Bioinformatics T5 genomic mining GNN
19	Aspektbaserad Sentimentanalys för Business Intelligence inom E-handeln / Aspect-Based Sentiment Analysis for Business Intelligence in E-commerce Eriksson, Albin, Mauritzon, Anton January 2022 (has links) Many companies strive to make data-driven decisions. To achieve this, they need to explore new tools for Business Intelligence. The aim of this study was to examine the performance and usability of aspect-based sentiment analysis as a tool for Business Intelligence in E-commerce. The study was conducted in collaboration with Ellos Group AB which supplied anonymous customer feedback data. The implementation consists of two parts, aspect extraction and sentiment classification. The f irst part, aspect extraction, was implemented using dependency parsing and various aspect grouping techniques. The second part, sentiment classification, was implemented using the language model KB-BERT, a Swedish version of the BERT model. The method for aspect extraction achieved a satisfactory precision of 79,5% but only a recall of 27,2%. Moreover, the result for sentiment classification was unsatisfactory with an accuracy of 68,2%. Although the results underperform expectations, we conclude that aspect-based sentiment analysis in general is a great tool for Business Intelligence. Both as a means of generating customer insights from previously unused data and to increase productivity. However, it should only be used as a supportive tool and not to replace existing processes for decision-making. / Många företag strävar efter att fatta datadrivna beslut. För att åstadkomma detta behöver de utforska nya metoder för Business Intelligence. Syftet med denna studie var att undersöka prestandan och användbarheten av aspektbaserad sentimentanalys som ett verktyg för Business Intelligence inom e-handeln. Studien genomfördes i samarbete med Ellos Group AB som tillhandahöll data bestående av anonym kundfeedback. Implementationen består av två delar, aspektextraktion och sentimentklassificering. Aspektextraktion implementerades med hjälp av dependensparsning och olika aspektgrupperingstekniker. Sentimentklassificering implementerades med hjälp av språkmodellen KB-BERT, en svensk version av BERT. Metoden för aspektextraktion uppnådde en tillfredsställande precision på 79,5% men endast en recall på 27,2%. Resultatet för sentimentklassificering var otillfredsställande med en accuracy på 68,2%. Även om resultaten underpresterar förväntningarna drar vi slutsatsen att aspektbaserad sentimentanalys i allmänhet är ett bra verktyg för Business Intelligence. Både som ett sätt att generera kundinsikter från tidigare oanvända data och som ett sätt att öka produktiviteten. Det bör dock endast användas som ett stödjande verktyg och inte ersätta befintliga processer för beslutsfattande. Aspect-based sentiment analysis Aspect extraction BERT Business Intelligence Dependency parsing Ecommerce KB-BERT Sentiment analysis Sentiment classification Aspektbaserad sentimentanalys Aspektextraktion BERT Business Intelligence Dependensparsning E-handel KB-BERT Sentimentanalys Sentimentklassificering Computer and Information Sciences Data- och informationsvetenskap
20	Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings using a Joint Loss Function Attieh, Joseph January 2022 (has links) Recent studies show that the spatial distribution of the sentence representations generated from pre-trained language models is highly anisotropic, meaning that the representations are not uniformly distributed among the directions of the embedding space. Thus, the expressiveness of the embedding space is limited, as the embeddings are less distinguishable and less diverse. This results in a degradation in the performance of the models on the downstream task. Most methods that define the state-of-the-art in this area proceed by improving the isotropy of the sentence embeddings by refining the corresponding contextual word representations, then deriving the sentence embeddings from these refined representations. In this thesis, we propose to improve the quality and distribution of the sentence embeddings extracted from the [CLS] token of the pre-trained language models by improving the isotropy of the embeddings. We add one feed-forward layer, referred to as the Isotropy Layer, between the model and the downstream task layers. We train this layer using a novel joint loss function that optimizes an isotropy quality measure and the downstream task loss. This joint loss pushes the embeddings outputted by the Isotropy Layer to be more isotropic, and it also retains the semantics needed to perform the downstream task. The proposed approach results in transformed embeddings with better isotropy, that generalize better on the downstream task. Furthermore, the approach requires training one feed-forward layer, instead of retraining the whole network. We quantify and evaluate the isotropy through multiple metrics, mainly the Explained Variance and the IsoScore. Experimental results on 3 GLUE datasets with classification as the downstream task show that our proposed method is on par with the state-of-the-art, as it achieves performance gains of around 2-3% on the downstream tasks compared to the baseline. We also present a small case study on one language abuse detection dataset, then interpret some of the findings in light of the results. / Nya studier visar att den rumsliga fördelningen av de meningsrepresentationer som ge- nereras från förtränade språkmodeller är mycket anisotropisk, vilket innebär att representationerna mellan riktningarna i inbäddningsutrymmet inte är jämnt fördelade. Inbäddningsutrymmets uttrycksförmåga är således begränsad, eftersom inbäddningarna är mindre särskiljbara och mindre varierande. Detta leder till att modellernas prestanda försämras i nedströmsuppgiften. De flesta metoder som definierar den senaste tekniken på detta område går ut på att förbättra isotropin hos inbäddningarna av meningar genom att förädla motsvarande kontextuella ordrepresentationer och sedan härleda inbäddningarna av meningar från dessa förädlade representationer. I den här avhandlingen föreslår vi att kvaliteten och fördelningen av de inbäddningar av meningar som utvinns från [CLS]-tokenet i de förtränade språkmodellerna förbättras genom inbäddningarnas isotropi. Vi lägger till ett feed-forward-skikt, kallat det isotropa skiktet, mellan modellen och de nedströms liggande uppgiftsskikten. Detta lager tränas med hjälp av en ny gemensam förlustfunktion som optimerar ett kvalitetsmått för isotropi och förlusten av nedströmsuppgiften. Den gemensamma förlusten resulterar i att de inbäddningar som produceras av det isotropa lagret blir mer isotropa, samtidigt som den semantik som behövs för att utföra den nedströms liggande uppgiften bibehålls. Det föreslagna tillvägagångssättet resulterar i transformerade inbäddningar med bättre isotropi, som generaliseras bättre för den efterföljande uppgiften. Dessutom kräver tillvägagångssättet träning av ett feed-forward-skikt, i stället för omskolning av hela nätverket. Vi kvantifierar och utvärderar isotropin med hjälp av flera mått, främst Förklarad Varians och IsoScore. Experimentella resultat på tre GLUE-dataset visar att vår föreslagna metod är likvärdig med den senaste tekniken, eftersom den uppnår prestandaökningar på cirka 2-3 % på nedströmsuppgifterna jämfört med baslinjen. Vi presenterar även en liten fallstudie på ett dataset för upptäckt av språkmissbruk och tolkar sedan några av resultaten mot bakgrund av dessa. Text Classification Isotropy Embeddings BERT IsoScore Klassificering av Text Isotropi Inbäddningar BERT IsoScore Computer and Information Sciences Data- och informationsvetenskap

Search results