Global ETD Search

91	Natural Language Processing, Statistical Inference, and American Foreign Policy Lauretig, Adam M. 06 November 2019 (has links) No description available. Political Science Political Science Text as Data Natural Language Processing Variational Bayesian Word Embeddings Securitization
92	Facilitating Corpus Annotation by Improving Annotation Aggregation Felt, Paul L 01 December 2015 (has links) (PDF) Annotated text corpora facilitate the linguistic investigation of language as well as the automation of natural language processing (NLP) tasks. NLP tasks include problems such as spam email detection, grammatical analysis, and identifying mentions of people, places, and events in text. However, constructing high quality annotated corpora can be expensive. Cost can be reduced by employing low-cost internet workers in a practice known as crowdsourcing, but the resulting annotations are often inaccurate, decreasing the usefulness of a corpus. This inaccuracy is typically mitigated by collecting multiple redundant judgments and aggregating them (e.g., via majority vote) to produce high quality consensus answers. We improve the quality of consensus labels inferred from imperfect annotations in a number of ways. We show that transfer learning can be used to derive benefit from out-dated annotations which would typically be discarded. We show that, contrary to popular preference, annotation aggregation models that take a generative data modeling approach tend to outperform those that take a condition approach. We leverage this insight to develop csLDA, a novel annotation aggregation model that improves on the state of the art for a variety of annotation tasks. When data does not permit generative data modeling, we identify a conditional data modeling approach based on vector-space text representations that achieves state-of-the-art results on several unusual semantic annotation tasks. Finally, we identify a family of models capable of aggregating annotation data containing heterogenous annotation types such as label frequencies and labeled features. We present a multiannotator active learning algorithm for this model family that jointly selects an annotator, data items, and annotation type. crowdsourcing corpus annotation semantic embeddings LDA rich prior knowledge Computer Sciences
93	Sustainable Recipe Recommendation System: Evaluating the Performance of GPT Embeddings versus state-of-the-art systems Bandaru, Jaya Shankar, Appili, Sai Keerthi January 2023 (has links) Background: The demand for a sustainable lifestyle is increasing due to the need to tackle rapid climate change. One-third of carbon emissions come from the food industry; reducing emissions from this industry is crucial when fighting climate change. One of the ways to reduce carbon emissions from this industry is by helping consumers adopt sustainable eating habits by consuming eco-friendly food. To help consumers find eco-friendly recipes, we developed a sustainable recipe recommendation system that can recommend relevant and eco-friendly recipes to consumers using little information about their previous food consumption. Objective: The main objective of this research is to identify (i) the appropriate recommendation algorithm suitable for a dataset that has few training and testing examples, and (ii) a technique to re-order the recommendation list such that a proper balance is maintained between relevance and carbon rating of the recipes. Method: We conducted an experiment to test the performance of a GPT embeddings-based recommendation system, Factorization Machines, and a version of a Graph Neural Network-based recommendation algorithm called PinSage for a different number of training examples and used ROC AUC value as our metric. After finding the best-performing model we experimented with different re-ordering techniques to find which technique provides the right balance between relevance and sustainability. Results: The results from the experiment show that the PinSage and Factorization Machines predict on average whether an item is relevant or not with 75% probability whereas GPT-embedding-based recommendation systems predict with only 55% probability. We also found the performance of PinSage and Factorization Machines improved as the training set size increased. For re-ordering, we found using a loga- rithmic combination of the relevance score and carbon rating of the recipe helped to reduce the average carbon rating of recommendations with a marginal reduction in the ROC AUC score. Conclusion: The results show that the chosen state-of-the-art recommendation systems: PinSage and Factorization Machines outperform GPT-embedding-based recommendation systems by almost 1.4 times. Sustainable Recipe Recommendation Recommendation System GPT-embeddings PinSage Factorization Machines Computer Sciences Datavetenskap (datalogi)
94	Automated Software Defect Localization Ye, Xin 23 September 2016 (has links) No description available. Computer Science Software maintenance bug reports learning to rank word embeddings API documents
95	Higher-order reasoning with graph data Leonardo de Abreu Cotta (13170135) 29 July 2022 (has links) <p>Graphs are the natural framework of many of today’s highest impact computing applications: from online social networking, to Web search, to product recommendations, to chemistry, to bioinformatics, to knowledge bases, to mobile ad-hoc networking. To develop successful applications in these domains, we often need representation learning methods ---models mapping nodes, edges, subgraphs or entire graphs to some meaningful vector space. Such models are studied in the machine learning subfield of graph representation learning (GRL). Previous GRL research has focused on learning node or entire graph representations through associational tasks. In this work I study higher-order (k>1-node) representations of graphs in the context of both associational and counterfactual tasks.<br> </p> Knowledge representation and reasoning Deep learning Neural networks Semi- and unsupervised learning graph embeddings causal inference mcmc
96	News Analytics for Global Infectious Disease Surveillance Ghosh, Saurav 29 November 2017 (has links) Traditional disease surveillance can be augmented with a wide variety of open sources, such as online news media, twitter, blogs, and web search records. Rapidly increasing volumes of these open sources are proving to be extremely valuable resources in helping analyze, detect, and forecast outbreaks of infectious diseases, especially new diseases or diseases spreading to new regions. However, these sources are in general unstructured (noisy) and construction of surveillance tools ranging from real-time disease outbreak monitoring to construction of epidemiological line lists involves considerable human supervision. Intelligent modeling of such sources using text mining methods such as, topic models, deep learning and dependency parsing can lead to automated generation of the mentioned surveillance tools. Moreover, real-time global availability of these open sources from web-based bio-surveillance systems, such as HealthMap and WHO Disease Outbreak News (DONs) can aid in development of generic tools which will be applicable to a wide range of diseases (rare, endemic and emerging) across different regions of the world. In this dissertation, we explore various methods of using internet news reports to develop generic surveillance tools which can supplement traditional surveillance systems and aid in early detection of outbreaks. We primarily investigate three major problems related to infectious disease surveillance as follows. (i) Can trends in online news reporting monitor and possibly estimate infectious disease outbreaks? We introduce approaches that use temporal topic models over HealthMap corpus for detecting rare and endemic disease topics as well as capturing temporal trends (seasonality, abrupt peaks) for each disease topic. The discovery of temporal topic trends is followed by time-series regression techniques to estimate future disease incidence. (ii) In the second problem, we seek to automate the creation of epidemiological line lists for emerging diseases from WHO DONs in a near real-time setting. For this purpose, we formulate Guided Epidemiological Line List (GELL), an approach that combines neural word embeddings with information extracted from dependency parse-trees at the sentence level to extract line list features. (iii) Finally, for the third problem, we aim to characterize diseases automatically from HealthMap corpus using a disease-specific word embedding model which were subsequently evaluated against human curated ones for accuracies. / Ph. D. Infectious Disease Surveillance HealthMap WHO DONs Temporal Topic Modeling Guided Epidemiological Line List Word Embeddings
97	Automatisering av CPV- klassificering : En studie om Large Language Models i kombination med word embeddings kan lösa CPV-kategorisering av offentliga upphandlingar. Andersson, Niklas, Andersson Sjöberg, Hanna January 2024 (has links) Denna studie utforskar användningen av Large Language Models och word embeddings för attautomatisera kategoriseringen av CPV-koder inom svenska offentliga upphandlingar. Tidigarestudier har inte lyckats uppnå tillförlitlig kategorisering, men detta experiment testar en nymetod som innefattar LLM-modellerna Mistral och Llama3 samt FastText word embeddings. Resultaten visar att även om studiens lösning korrekt kan identifiera vissa CPV-huvudgrupper, är dess övergripande prestanda låg med ett resultat på 12% för en helt korrekt klassificering av upphandlingar och 35% för en delvis korrekt klassificering med minst en korrekt funnen CPV-huvudgrupp. Förbättringar behövs både när det kommer till korrekthet och noggrannhet. Studien bidrar till forskningsfältet genom att påvisa de utmaningar och potentiella lösningar som finns för automatiserad kategorisering av offentliga upphandlingar. Den föreslår även framtida forskning som omfattar användningen av större och mer avancerade modeller för att adressera de identifierade utmaningarna. CPV-kategorisering Offentliga upphandlingar Large Language Models Word embeddings Computer Sciences Datavetenskap (datalogi)
98	Semantic Structuring Of Digital Documents: Knowledge Graph Generation And Evaluation Luu, Erik E 01 June 2024 (has links) (PDF) In the era of total digitization of documents, navigating vast and heterogeneous data landscapes presents significant challenges for effective information retrieval, both for humans and digital agents. Traditional methods of knowledge organization often struggle to keep pace with evolving user demands, resulting in suboptimal outcomes such as information overload and disorganized data. This thesis presents a case study on a pipeline that leverages principles from cognitive science, graph theory, and semantic computing to generate semantically organized knowledge graphs. By evaluating a combination of different models, methodologies, and algorithms, the pipeline aims to enhance the organization and retrieval of digital documents. The proposed approach focuses on representing documents as vector embeddings, clustering similar documents, and constructing a connected and scalable knowledge graph. This graph not only captures semantic relationships between documents but also ensures efficient traversal and exploration. The practical application of the system is demonstrated in the context of digital libraries and academic research, showcasing its potential to improve information management and discovery. The effectiveness of the pipeline is validated through extensive experiments using contemporary open-source tools. Knowledge Management Semantic Embeddings Knowledge Graphs Documents Artificial Intelligence and Robotics Databases and Information Systems Data Science
99	Duplicate Detection and Text Classification on Simplified Technical English / Dublettdetektion och textklassificering på Förenklad Teknisk Engelska Lund, Max January 2019 (has links) This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection. NLP CNL transformer models LSTM BERT document embeddings word embeddings text classification text clustering transfer learning machine learning Computer Sciences Datavetenskap (datalogi)
100	O uso de recursos linguísticos para mensurar a semelhança semântica entre frases curtas através de uma abordagem híbrida Silva, Allan de Barcelos 14 December 2017 (has links) Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-04-04T11:46:54Z No. of bitstreams: 1 Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5) / Made available in DSpace on 2018-04-04T11:46:55Z (GMT). No. of bitstreams: 1 Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5) Previous issue date: 2017-12-14 / Nenhuma / Na área de Processamento de Linguagem Natural, a avaliação da similaridade semântica textual é considerada como um elemento importante para a construção de recursos em diversas frentes de trabalho, tais como a recuperação de informações, a classificação de textos, o agrupamento de documentos, as aplicações de tradução, a interação através de diálogos, entre outras. A literatura da área descreve aplicações e técnicas voltadas, em grande parte, para a língua inglesa. Além disso, observa-se o uso prioritário de recursos probabilísticos, enquanto os aspectos linguísticos são utilizados de forma incipiente. Trabalhos na área destacam que a linguística possui um papel fundamental na avaliação de similaridade semântica textual, justamente por ampliar o potencial dos métodos exclusivamente probabilísticos e evitar algumas de suas falhas, que em boa medida são resultado da falta de tratamento mais aprofundado de aspectos da língua. Este contexto é potencializado no tratamento de frases curtas, que consistem no maior campo de utilização das técnicas de similaridade semântica textual, pois este tipo de sentença é composto por um conjunto reduzido de informações, diminuindo assim a capacidade de tratamento probabilístico eficiente. Logo, considera-se vital a identificação e aplicação de recursos a partir do estudo mais aprofundado da língua para melhor compreensão dos aspectos que definem a similaridade entre sentenças. O presente trabalho apresenta uma abordagem para avaliação da similaridade semântica textual em frases curtas no idioma português brasileiro. O principal diferencial apresentado é o uso de uma abordagem híbrida, na qual tanto os recursos de representação distribuída como os aspectos léxicos e linguísticos são utilizados. Para a consolidação do estudo, foi definida uma metodologia que permite a análise de diversas combinações de recursos, possibilitando a avaliação dos ganhos que são introduzidos com a ampliação de aspectos linguísticos e também através de sua combinação com o conhecimento gerado por outras técnicas. A abordagem proposta foi avaliada com relação a conjuntos de dados conhecidos na literatura (evento PROPOR 2016) e obteve bons resultados. / One of the areas of Natural language processing (NLP), the task of assessing the Semantic Textual Similarity (STS) is one of the challenges in NLP and comes playing an increasingly important role in related applications. The STS is a fundamental part of techniques and approaches in several areas, such as information retrieval, text classification, document clustering, applications in the areas of translation, check for duplicates and others. The literature describes the experimentation with almost exclusive application in the English language, in addition to the priority use of probabilistic resources, exploring the linguistic ones in an incipient way. Since the linguistic plays a fundamental role in the analysis of semantic textual similarity between short sentences, because exclusively probabilistic works fails in some way (e.g. identification of far or close related sentences, anaphora) due to lack of understanding of the language. This fact stems from the few non-linguistic information in short sentences. Therefore, it is vital to identify and apply linguistic resources for better understand what make two or more sentences similar or not. The current work presents a hybrid approach, in which are used both of distributed, lexical and linguistic aspects for an evaluation of semantic textual similarity between short sentences in Brazilian Portuguese. We evaluated proposed approach with well-known and respected datasets in the literature (PROPOR 2016) and obtained good results. Processamento de linguagem natural Similaridade semântica textual Linguística Aprendizagem de máquina Support vector machines Word embeddings Principal component analysis Natural language processing Semantic textual similarity Linguistic Machine learning Support vector machines Word embeddings Principal component analysis

Search results