Global ETD Search

941	Decentralizing Large-Scale Natural Language Processing with Federated Learning / Decentralisering av storskalig naturlig språkbearbetning med förenat lärande Garcia Bernal, Daniel January 2020 (has links) Natural Language Processing (NLP) is one of the most popular and visible forms of Artificial Intelligence in recent years. This is partly because it has to do with a common characteristic of human beings: language. NLP applications allow to create new services in the industrial sector in order to offer new solutions and provide significant productivity gains. All of this has happened thanks to the rapid progression of Deep Learning models. Large scale contextual representation models, such asWord2Vec, ELMo and BERT, have significantly advanced NLP in recently years. With these latest NLP models, it is possible to understand the semantics of text to a degree never seen before. However, they require large amounts of text data to process to achieve high-quality results. This data can be gathered from different sources, but one of the main collection points are devices such as smartphones, smart appliances and smart sensors. Lamentably, joining and accessing all this data from multiple sources is extremely challenging due to privacy and regulatory reasons. New protocols and techniques have been developed to solve this limitation by training models in a massively distributed manner taking advantage of the powerful characteristic of the devices that generates the data. Particularly, this research aims to test the viability of training NLP models, in specific Word2Vec, with a massively distributed protocol like Federated Learning. The results show that FederatedWord2Vecworks as good as Word2Vec is most of the scenarios, even surpassing it in some semantics benchmark tasks. It is a novel area of research, where few studies have been conducted, with a large knowledge gap to fill in future researches. / Naturlig språkbehandling är en av de mest populära och synliga formerna av artificiell intelligens under de senaste åren. Det beror delvis på att det har att göra med en gemensam egenskap hos människor: språk. Naturlig språkbehandling applikationer gör det möjligt att skapa nya tjänster inom industrisektorn för att erbjuda nya lösningar och ge betydande produktivitetsvinster. Allt detta har hänt tack vare den snabba utvecklingen av modeller för djup inlärning. Modeller i storskaligt sammanhang, som Word2Vec, ELMo och BERT har väsentligt avancerat naturligt språkbehandling på senare tid år. Med dessa senaste naturliga språkbearbetningsmo modeller är det möjligt att förstå textens semantik i en grad som aldrig sett förut. De kräver dock stora mängder textdata för att bearbeta för att uppnå högkvalitativa resultat. Denna information kan samlas in från olika källor, men ett av de viktigaste insamlingsställena är enheter som smartphones, smarta apparater och smarta sensorer. Beklagligtvis är det extremt utmanande att gå med och komma åt alla dessa uppgifter från flera källor på grund av integritetsskäl och regleringsskäl. Nya protokoll och tekniker har utvecklats för att lösa denna begränsning genom att träna modeller på ett massivt distribuerat sätt med fördel av de kraftfulla egenskaperna hos enheterna som genererar data. Särskilt syftar denna forskning till att testa livskraften för att utbilda naturligt språkbehandling modeller, i specifika Word2Vec, med ett massivt distribuerat protokoll som Förenat Lärande. Resultaten visar att det Förenade Word2Vec fungerar lika bra som Word2Vec är de flesta av scenarierna, till och med överträffar det i vissa semantiska riktmärken. Det är ett nytt forskningsområde, där få studier har genomförts, med ett stort kunskapsgap för att fylla i framtida forskningar. Natural Language Processing distributed systems Federated Learning Word2Vec Naturligt språkbehandling distribuerade system federerat lärande Word2Vec Computer and Information Sciences Data- och informationsvetenskap
942	Log Classification using a Shallow-and-Wide Convolutional Neural Network and Log Keys / Logklassificering med ett grunt-och-brett faltningsnätverk och loggnycklar Annergren, Björn January 2018 (has links) A dataset consisting of logs describing results of tests from a single Build and Test process, used in a Continous Integration setting, is utilized to automate categorization of the logs according to failure types. Two different features are evaluated, words and log keys, using unordered document matrices as document representations to determine the viability of log keys. The experiment uses Multinomial Naive Bayes, MNB, classifiers and multi-class Support Vector Machines, SVM, to establish the performance of the different features. The experiment indicates that log keys are equivalent to using words whilst achieving a great reduction in dictionary size. Three different multi-layer perceptrons are evaluated on the log key document matrices achieving slightly higher cross-validation accuracies than the SVM. A shallow-and-wide Convolutional Neural Network, CNN, is then designed using temporal sequences of log keys as document representations. The top performing model of each model architecture is evaluated on a test set except for the MNB classifiers as the MNB had subpar performance during cross-validation. The test set evaluation indicates that the CNN is superior to the other models. / Ett dataset som består av loggar som beskriver resultat av test från en bygg- och testprocess, använt i en miljö med kontinuerlig integration, används för att automatiskt kategorisera loggar enligt olika feltyper. Två olika sorters indata evalueras, ord och loggnycklar, där icke- ordnade dokumentmatriser används som dokumentrepresentationer för att avgöra loggnycklars användbarhet. Experimentet använder multinomial naiv bayes, MNB, som klassificerare och multiklass-supportvektormaskiner, SVM, för att avgöra prestandan för de olika sorternas indata. Experimentet indikerar att loggnycklar är ekvivalenta med ord medan loggnycklar har mycket mindre ordboksstorlek. Tre olika multi-lager-perceptroner evalueras på loggnyckel-dokumentmatriser och får något högre exakthet i krossvalideringen jämfört med SVM. Ett grunt-och-brett faltningsnätverk, CNN, designas med tidsmässiga sekvenser av loggnycklar som dokumentrepresentationer. De topppresterande modellerna av varje modellarkitektur evalueras på ett testset, utom för MNB-klassificerarna då MNB har dålig prestanda under krossvalidering. Evalueringen av testsetet indikerar att CNN:en är bättre än de andra modellerna. log classification log keys natural language processing Annan elektroteknik och elektronik
943	Plot Extraction and the Visualization of Narrative Flow DeBuse, Michael A. 20 July 2021 (has links) In order to facilitate the automated extraction of complex features and structures within narrative, namely plot in this study, two proof-of-concept methods of narrative visualization are presented with the goal of representing the plot of the narrative. Plot is defined to give a basis for quality assessment and comparison. The first visualization presented is a scatter-plot of entities within the story, but due to failing to uphold the definition of plot, in-depth analysis is not performed. The second visualization presented is a graph structure that better represents a mapping of the plot of the story. Narrative structures commonly found within the plot maps are shown and discussed, and comparisons with ground-truth plot maps are made, showing that this method of visualization represents the plot and narrative flow of the stories. natural language processing computational linguistics digital humanities narrative flow content extraction plot narrative visualization Physical Sciences and Mathematics
944	SPOONS: Netflix Outage Detection Using Microtext Classification Augusitne, Eriq A 01 March 2013 (has links) (PDF) Every week there are over a billion new posts to Twitter services and many of those messages contain feedback to companies about their services. One company that recognizes this unused source of information is Netflix. That is why Netflix initiated the development of a system that lets them respond to the millions of Twitter and Netflix users that are acting as sensors and reporting all types of user visible outages. This system enhances the feedback loop between Netflix and its customers by increasing the amount of customer feedback that Netflix receives and reducing the time it takes for Netflix to receive the reports and respond to them. The goal of the SPOONS (Swift Perceptions of Online Negative Situations) system is to use Twitter posts to determine when Netflix users are reporting a problem with any of the Netflix services. This work covers the architecture of the SPOONS system and framework as well as outage detection using tweet classification. Outage Detection Natural Language Processing Machine Learning Real-Time Social Media Distributed System Artificial Intelligence and Robotics Databases and Information Systems
945	Synthetic data generation for domain adaptation of a retriever-reader Question Answering system for the Telecom domain : Comparing dense embeddings with BM25 for Open Domain Question Answering / Syntetisk data genering för domänadaptering av ett retriever-readerbaserat frågebesvaringssystem för telekomdomänen : En jämförelse av dense embeddings med BM25 för Öpen Domän frågebesvaring Döringer Kana, Filip January 2023 (has links) Having computer systems capable of answering questions has been a goal within Natural Language Processing research for many years. Machine Learning systems have recently become increasingly proficient at this task with large language models obtaining state-of-the-art performance. Retriever-reader architectures have become a powerful approach for building systems that enable users to enter questions and get factual answers from a corpus of documents. This architecture uses a retriever component that fetches the most relevant documents and a reader which in turn extracts the answer from the documents. These systems commonly use transformer-based models for both components, which have been fine-tuned on a general domain of documents, such as Wikipedia. However, the performance of such systems on new domains, with different vocabularies, can be lacking. Furthermore, new domains of, for instance, company-specific documents often lack annotated data which makes training new models cumbersome. This thesis investigated how a retriever-reader-based architecture can be adapted to a corpus of Telecom documents by generating question-answer data using a large generative language model, GPT3.5. Also, it compared the usage of a dense retriever using BERT to a BM25-based retriever on the domain. Findings suggest that generating training data can be an effective approach for fine-tuning a dense retriever, increasing the Top-K retrieval accuracy by 20 points for k = 10, compared to a dense retriever fine-tuned on Wikipedia. Additionally, it is found that the sparse retriever outperforms the best dense retriever, although, there is reason to believe that the structure of the test dataset could influence this. Finally, the results also indicate that the performance of the reader is not improved by the generated data although future work is needed to draw better conclusions. / Datorsystem som kan svara på frågor har varit ett mål inom forskningsfältet naturlig språkbehandling i många år. System som använder sig av maskininlärning, så som stora språkmodeller har under de senaste åren uppnått hög prestanda. Att använda sig av en så kallad retriever-reader arkitektur har blivit ett kraftfullt tillvägagångssätt för att bygga system som gör det möjligt för användare att ställa frågor och få faktabaserade svar hämtade från en korpus av dokument. Denna arkitektur använder en retriever som hämtar den mest relevanta informationen och en reader som sedan extraherar ett svar från den hämtade informationen. Dessa system använder vanligtvis transformer-baserade modeller för båda komponenterna, som har tränats på en allmän domän som t.ex., Wikipedia. Dock kan prestandan hos dessa system vara bristfällig när de appliceras på mer specifika domäner med andra ordförråd. Dessutom saknas ofta annoterad data för mer specifika domäner, som exempelvis företagsdokument, vilket gör det svårt att träna modeller på dessa områden. I denna avhandling undersöktes hur en retriever-reader arkitektur kan appliceras på en korpus telekomdokument genom att generera data bestående av frågor och tillhörande svar, genom att använda en stor generativ språkmodell, GPT3.5. Rapporten jämförde även användandet av en BERT-baserad retriever med en BM25-baserad retriever för denna domän. Resultaten tyder på att generering av träningsdata kan vara ett effektivt tillvägagångssätt för att träna en BERT-baserad retriever. Den tränade modellen hade 20 poäng högre noggranhet för måttet Top-K retrieval vid k = 10 jämfört med samma model tränad på data från Wikipedia. Resultaten visade även att en BM25-baserad retriever hade högre noggranhet än den bästa BERT-baserade retrievern som tränats. Dock kan detta bero på datasetets utformning. Slutligen visade resultaten även att prestandan hos en tränad reader inte blev bättre genom att träna på genererad data men denna slutsats kräver framtida arbete för att undersökas mer noggrant. Natural Language Processing Transformers Deep Learning Question Answering Data Generation Språkteknologi Transformers Djupinlärning Frågebesvaring Datagenerering Computer and Information Sciences Data- och informationsvetenskap
946	Embodied Virtual Reality: The Impacts of Human-Nature Connection During Engineering Design Trump, Joshua Jordan 19 March 2024 (has links) The engineering design process can underutilize nature-based solutions during infrastructure development. Instances of nature within the built environment are reflections of the human-nature connection, which may alter how designers ideate solutions to a given design task, especially through virtual reality (VR) as an embodied perspective taking platform. Embodied VR helps designers "see" as an end-user sees, inclusive of the natural environment through the uptake of an avatar, such as a bird or fish. Embodied VR emits empathy toward the avatar, e.g., to see as a bird in VR, one tends to feel and think as a bird. Furthermore, embodied VR also impacts altruistic behavior toward the environment, specifically through proenvironmental behaviors. However, limited research discovers the impact of embodied VR on the human-nature connection and if embodied VR has any impact on how designers ideate, specifically surrounding nature-based solutions as a form of a proenvironmental behavior during the design process. This research first presents a formal measurement of embodied VR's impact on the human-nature connection and maps this impact toward design-related proenvironmental behaviors through design ideas, i. e., tracking changes in nature-based design choices. The design study consisted of three groups of engineering undergraduate students which were given a case study and plan review: a VR group embodying a bird (n=35), a self-lens VR group (n=34), and a control group (n=33). The case study was about a federal mandate to minimize combined sewer overflow in a neighborhood within Cincinnati, OH. Following the plan review, VR groups were given a VR walkthrough or flythrough of the case study area of interest as a selected avatar (embodied:bird, self-lens:oneself). Participants were tested for their connectedness to nature and a mock-design charrette was held to measure engineering design ideas. Verbal protocol analysis was followed, instructing participants to think aloud. Design ideation sessions were recorded and manually transcribed. The results of the study indicated that embodiment impacts the human-nature connection based on participants' perceived connection to nature. Only the bird group witnessed an increase in connectedness to nature, whereas the self-lens and control groups did not report any change. This change in connectedness to nature was also confirmed by engineering design ideas. The bird group was more likely to ideate green-thinking designs to solve the stormwater issue and benefit both nature and socioeconomic conditions, whereas the control group mostly discussed gray designs as the catalyst for minimizing combined sewer overflows. The self-lens group also mentioned green design ideas as well as socioeconomic change, but mostly placed the beneficiary of the design toward people rather than nature in the bird group. The mode of analysis for these findings was driven by thematic content analysis, an exploration of design space as a function of semantic distance, and large language models (LLMs) to synthesize design ideas and themes. An LLM's performance lent accuracy to the design ideas in comparison to thematic content analysis, but struggled to cross-compare groups to provide generalizable findings. This research is intended to benefit the engineering design process with a) the benefit of perspective-taking on design ideas based on lenses of embodied VR and b) various methods to supplement thematic content analysis for coding design ideas. / Doctor of Philosophy / The use of nature in the constructed world, such as rain gardens and natural streams for moving stormwater, is underused during the design process. Virtual reality (VR) programs, like embodiment, have the potential to increase the incorporation of nature and nature-based elements during design. Embodiment is the process of taking on the vantage point of another being or avatar, such as a bird, fish, insect, or other being, in order to see and move as the avatar does. Embodied VR increases the likelihood that the VR participant will act favorably to the subject, specifically when the natural environment is involved. For example, embodying another individual cutting down trees in a virtual forest increased the likelihood that individuals would act favorably to the environment, such as through recycling or conserving energy (Ahn and Bailenson, 2012). Ultimately, this research measures the level of connection participants feel with the environment after an embodied VR experience and motions to discover if this change in connection to nature impacts how participants might design a solution to a problem. This design experiment is based on a case study, which all participants were provided alongside supplemental plan documents of the case. The case study used is about stormwater issues and overflows from infrastructure in a neighborhood in Cincinnati, OH, where key decision-makers were mandated by the federal government to minimize the overflows. The bird group (a bird avatar) performed a fly-through in the area of interest in VR, whereas the self-lens group (first-person, embodying oneself) walked through the same area. The control group received no VR intervention. Following the intervention, participants were asked to re-design the neighborhood and orate their recorded solution. Then, participants were required to score a questionnaire measuring their connectedness to nature. The results show that when people experience the space as a bird in virtual reality, they felt more connected to nature and also included more ideas related to nature in their design. More specifically, ideas involving green infrastructure (using nature-based elements, e.g., rain gardens and streams) and socioeconomic benefits were brought up by the bird group. This research presents embodiment as a tool that can change how engineers design. As stormwater policy has called for more use of green infrastructure (notably, through the Environmental Protection Agency), embodiment may be used during the design process to meet this call from governmental programs. Furthermore, this research impacts how embodiment's effects on design can be interpreted, specifically through quantitative methods through natural language processing and the use of large language models to analyze data and report back on design-related findings. This research is intended to benefit the design process with a) using different avatars in embodiment to impact design ideas and b) a comparison of thematic content analysis and large language models in summarizing design ideas and themes. Virtual reality embodiment design ideas engineering empathy human-nature connection thematic content analysis natural language processing large language models
947	Incremental Re-tokenization in BPE-trained SentencePiece Models Hellsten, Simon January 2024 (has links) This bachelor's thesis in Computer Science explores the efficiency of an incremental re-tokenization algorithm in the context of BPE-trained SentencePiece models used in natural language processing. The thesis begins by underscoring the critical role of tokenization in NLP, particularly highlighting the complexities introduced by modifications in tokenized text. It then presents an incremental re-tokenization algorithm, detailing its development and evaluating its performance against a full text re-tokenization. Experimental results demonstrate that this incremental approach is more time-efficient than full re-tokenization, especially evident in large text datasets. This efficiency is attributed to the algorithm's localized re-tokenization strategy, which limits processing to text areas around modifications. The research concludes by suggesting that incremental re-tokenization could significantly enhance the responsiveness and resource efficiency of text-based applications, such as chatbots and virtual assistants. Future work may focus on predictive models to anticipate the impact of text changes on token stability and optimizing the algorithm for different text contexts. BPE Byte Pair Encoding SentencePiece NLP Natural Language Processing Tokenization Re-tokenization
948	Language Modeling Using Image Representations of Natural Language Cho, Seong Eun 07 April 2023 (has links) (PDF) This thesis presents training of an end-to-end autoencoder model using the transformer, with an encoder that can encode sentences into fixed-length latent vectors and a decoder that can reconstruct the sentences using image representations. Encoding and decoding sentences to and from these image representations are central to the model design. This method allows new sentences to be generated by traversing the Euclidean space, which makes vector arithmetic possible using sentences. Machines excel in dealing with concrete numbers and calculations, but do not possess an innate infrastructure designed to help them understand abstract concepts like natural language. In order for a machine to process language, scaffolding must be provided wherein the abstract concept becomes concrete. The main objective of this research is to provide such scaffolding so that machines can process human language in an intuitive manner. machine learning deep learning natural language processing language modeling autoencoder transformer attention Jacobian matrix calculus Physical Sciences and Mathematics
949	IMAGE CAPTIONING USING TRANSFORMER ARCHITECTURE Wrucha A Nanal (14216009) 06 December 2022 (has links) <p> </p> <p>The domain of Deep Learning that is related to generation of textual description of images is called ‘Image Captioning.’ The central idea behind Image Captioning is to identify key features of an image and create meaningful sentences that describe the image. The current popular models include image captioning using Convolution Neural Network - Long Short-Term Memory (CNN-LSTM) based models and Attention based models. This research work first identifies the drawbacks of existing image captioning models namely – sequential style of execution, vanishing gradient problem and lack of context during training.</p> <p>This work aims at resolving the discovered problems by creating a Contextually Aware Image Captioning (CATIC) Model. The Transformer architecture, which solves the issues of vanishing gradients and sequential execution, forms the basis of the suggested model. In order to inject the contextualized embeddings of the caption sentences, this work uses Bidirectional Encoder Representation of Transformers (BERT). This work uses Remote Sensing Image Captioning Dataset. The results of the CATIC model are evaluated using BLEU, METEOR and ROGUE scores. On comparison the proposed model outperforms the CNN-LSTM model in all metrices. When compared to the Attention based model’s metrices, the CATIC model outperforms for BLEU2 and ROGUE metrices and gives competitive results for others.</p> Natural language processing Computer vision Deep learning Transformer Architecture Remote Sensing Images
950	NOVEL DATA MINING ALGORITHMS FOR ANALYSIS OF ELECTRONIC HEALTH RECORDS Chanda, Ashis, 0000-0002-0118-8901 January 2022 (has links) Medical health providers use electronic health records (EHRs) to store information about patient treatment to support patient care management and securely share health information among healthcare organizations. EHRs have also been used in healthcare research in problems such as patient phenotyping, health risk prediction, and medical entity extraction. In this thesis, we focus on several important issues: (1) how to convert natural text from medical notes to vector representations suitable for deep learning algorithms, (2) how to help healthcare researchers select a patient cohort from EHRs, and (3) how to use EHRs to identify patient diagnoses and treatments. In the first part of the thesis, we present a new method for learning vector representations of medical terms. Learning vector representations of words is an important pre-processing step in many natural language processing applications. For example, EHRs contain clinical notes that describe patient health conditions and course of treatment in a narrative style. The notes contain specialized medical terminology and many abbreviations. Learning good vector representations of specialized medical terms can improve the quality of downstream data analysis tasks on EHR data. However, the traditional approaches struggle to learn vector representations of rarely used medical terms. To overcome this problem, we developed a neural network-based approach, called definition2vec, that uses external knowledge contained in medical vocabularies. We performed quantitative and qualitative analysis to measure the usefulness of the learned representations. The results demonstrate that definition2vec is superior to the state-of-the-art algorithms. In the second part of the thesis, we describe a new visual interface that helps healthcare researchers select patient cohorts from EHR data. Process of identifying patients of interest for observational studies from EHR data is known as cohort selection, a challenging research problem. We considered a problem of cohort selection from medical claim data, which requires identifying a set of medical codes for selection. However, there are tens of thousands of unique medical codes, and it becomes very difficult for any human to decide which codes identify patients of interest. To help users in defining a set of codes for cohort identification, we developed an interactive system, called Medical Claim Visualization system (MedCV), which visualizes medical code representations. MedCV analyzes a medical claim database and allows users to reason about medical code relationships and define inclusion rules for the selection by visualizing medical codes, claims, and patient timelines. Evaluation of our system through a user study indicates that MedCV enables domain experts to define inclusion rules efficiently and with high quality. The third part of the thesis is a study of the definition of acute kidney injury (AKI), which is a condition where kidneys suddenly cannot filter waste from the blood. AKI is a major cause of patient death in intensive care units (ICU) and it is critical to detect it early. Recently published KDIGO medical guideline proposed a clinical definition of AKI using blood serum creatinine and urine output. The KDIGO definition was developed based on the expert knowledge, but very little is known about how well it matches the medical practice. In this study, we investigated publicly available EHR data from 47,499 ICU admissions to determine the concordance between the KDIGO definition and AKI determination by the medical provider. We show that it is possible to find a formula using machine learning with much higher concordance with the medical provider AKI coding than KDIGO and discuss the medical relevance of this finding. / Computer and Information Science Computer science Artificial intelligence Data mining Data visualization Electronic health records Machine learning Natural language processing Neural network

Search results