Spelling suggestions: "subject:"[een] EMBEDDINGS"" "subject:"[enn] EMBEDDINGS""
91 |
Facilitating Corpus Annotation by Improving Annotation AggregationFelt, Paul L 01 December 2015 (has links) (PDF)
Annotated text corpora facilitate the linguistic investigation of language as well as the automation of natural language processing (NLP) tasks. NLP tasks include problems such as spam email detection, grammatical analysis, and identifying mentions of people, places, and events in text. However, constructing high quality annotated corpora can be expensive. Cost can be reduced by employing low-cost internet workers in a practice known as crowdsourcing, but the resulting annotations are often inaccurate, decreasing the usefulness of a corpus. This inaccuracy is typically mitigated by collecting multiple redundant judgments and aggregating them (e.g., via majority vote) to produce high quality consensus answers. We improve the quality of consensus labels inferred from imperfect annotations in a number of ways. We show that transfer learning can be used to derive benefit from out-dated annotations which would typically be discarded. We show that, contrary to popular preference, annotation aggregation models that take a generative data modeling approach tend to outperform those that take a condition approach. We leverage this insight to develop csLDA, a novel annotation aggregation model that improves on the state of the art for a variety of annotation tasks. When data does not permit generative data modeling, we identify a conditional data modeling approach based on vector-space text representations that achieves state-of-the-art results on several unusual semantic annotation tasks. Finally, we identify a family of models capable of aggregating annotation data containing heterogenous annotation types such as label frequencies and labeled features. We present a multiannotator active learning algorithm for this model family that jointly selects an annotator, data items, and annotation type.
|
92 |
Sustainable Recipe Recommendation System: Evaluating the Performance of GPT Embeddings versus state-of-the-art systemsBandaru, Jaya Shankar, Appili, Sai Keerthi January 2023 (has links)
Background: The demand for a sustainable lifestyle is increasing due to the need to tackle rapid climate change. One-third of carbon emissions come from the food industry; reducing emissions from this industry is crucial when fighting climate change. One of the ways to reduce carbon emissions from this industry is by helping consumers adopt sustainable eating habits by consuming eco-friendly food. To help consumers find eco-friendly recipes, we developed a sustainable recipe recommendation system that can recommend relevant and eco-friendly recipes to consumers using little information about their previous food consumption. Objective: The main objective of this research is to identify (i) the appropriate recommendation algorithm suitable for a dataset that has few training and testing examples, and (ii) a technique to re-order the recommendation list such that a proper balance is maintained between relevance and carbon rating of the recipes. Method: We conducted an experiment to test the performance of a GPT embeddings-based recommendation system, Factorization Machines, and a version of a Graph Neural Network-based recommendation algorithm called PinSage for a different number of training examples and used ROC AUC value as our metric. After finding the best-performing model we experimented with different re-ordering techniques to find which technique provides the right balance between relevance and sustainability. Results: The results from the experiment show that the PinSage and Factorization Machines predict on average whether an item is relevant or not with 75% probability whereas GPT-embedding-based recommendation systems predict with only 55% probability. We also found the performance of PinSage and Factorization Machines improved as the training set size increased. For re-ordering, we found using a loga- rithmic combination of the relevance score and carbon rating of the recipe helped to reduce the average carbon rating of recommendations with a marginal reduction in the ROC AUC score. Conclusion: The results show that the chosen state-of-the-art recommendation systems: PinSage and Factorization Machines outperform GPT-embedding-based recommendation systems by almost 1.4 times.
|
93 |
Automated Software Defect LocalizationYe, Xin 23 September 2016 (has links)
No description available.
|
94 |
Higher-order reasoning with graph dataLeonardo de Abreu Cotta (13170135) 29 July 2022 (has links)
<p>Graphs are the natural framework of many of today’s highest impact computing applications: from online social networking, to Web search, to product recommendations, to chemistry, to bioinformatics, to knowledge bases, to mobile ad-hoc networking. To develop successful applications in these domains, we often need representation learning methods ---models mapping nodes, edges, subgraphs or entire graphs to some meaningful vector space. Such models are studied in the machine learning subfield of graph representation learning (GRL). Previous GRL research has focused on learning node or entire graph representations through associational tasks. In this work I study higher-order (k>1-node) representations of graphs in the context of both associational and counterfactual tasks.<br>
</p>
|
95 |
News Analytics for Global Infectious Disease SurveillanceGhosh, Saurav 29 November 2017 (has links)
Traditional disease surveillance can be augmented with a wide variety of open sources, such as online news media, twitter, blogs, and web search records. Rapidly increasing volumes of these open sources are proving to be extremely valuable resources in helping analyze, detect, and forecast outbreaks of infectious diseases, especially new diseases or diseases spreading to new regions. However, these sources are in general unstructured (noisy) and construction of surveillance tools ranging from real-time disease outbreak monitoring to construction of epidemiological line lists involves considerable human supervision. Intelligent modeling of such sources using text mining methods such as, topic models, deep learning and dependency parsing can lead to automated generation of the mentioned surveillance tools. Moreover, real-time global availability of these open sources from web-based bio-surveillance systems, such as HealthMap and WHO Disease Outbreak News (DONs) can aid in development of generic tools which will be applicable to a wide range of diseases (rare, endemic and emerging) across different regions of the world.
In this dissertation, we explore various methods of using internet news reports to develop generic surveillance tools which can supplement traditional surveillance systems and aid in early detection of outbreaks. We primarily investigate three major problems related to infectious disease surveillance as follows. (i) Can trends in online news reporting monitor and possibly estimate infectious disease outbreaks? We introduce approaches that use temporal topic models over HealthMap corpus for detecting rare and endemic disease topics as well as capturing temporal trends (seasonality, abrupt peaks) for each disease topic. The discovery of temporal topic trends is followed by time-series regression techniques to estimate future disease incidence. (ii) In the second problem, we seek to automate the creation of epidemiological line lists for emerging diseases from WHO DONs in a near real-time setting. For this purpose, we formulate Guided Epidemiological Line List (GELL), an approach that combines neural word embeddings with information extracted from dependency parse-trees at the sentence level to extract line list features. (iii) Finally, for the third problem, we aim to characterize diseases automatically from HealthMap corpus using a disease-specific word embedding model which were subsequently evaluated against human curated ones for accuracies. / Ph. D. / Infectious Disease Outbreaks are a threat to public health and economic stability of many countries. Traditional Disease Surveillance data released by organizations, such as CDC, ProMED is delayed and therefore, not reliable for real-time monitoring of infectious disease outbreaks. Recently, open source indicators, such as online news sources and social media sources (Twitter) have been shown to be effective in monitoring infectious disease outbreaks in real-time due to their volume, ease of availability and citizen participation. This dissertation focuses on developing multiple data analytic tools which perform automated analysis of online disease-related news articles with an aim to characterize infectious diseases and monitor their spatial and temporal progression in real-time. We show that temporal trends extracted from online news articles can be used to capture dynamics of multiple disease outbreaks, such as whooping cough outbreak in U.S. during summer of 2012, periodic outbreaks of H7N9 in China during 2013-2014 and emerging MERS outbreak in Saudi Arabia. However, online news reporting during infectious disease outbreaks is driven by interest and therefore, news coverage for certain diseases can be inconsistent over time leading to erroneous surveillance.
|
96 |
Semiparametric Modeling and Analysis for Time-varying Network DataSun, Jiajin January 2024 (has links)
Network data, capturing the connections or interactions among subjects of interest, are widely used across numerous scientific disciplines. Recent years have seen a significant increase in time-varying network data, which record not only the number of interactions but also the precise timestamps when these events occur. These data call for novel analytical developments that specifically leverage the event time information.
In this thesis, we propose frameworks for analyzing longitudinal/panel network data and continuous time network data. For the analysis of longitudinal network data, we introduce a semiparametric latent space model. The model consists of a static latent space component and a time-varying node-specific baseline component. We develop a semiparametric efficient score equation for the latent space parameter. Estimation is accomplished through a one-step update estimator and a suitably penalized maximum likelihood estimator. We derive oracle error bounds for both estimators and address identifiability concerns from a quotient manifold perspective.
For analyzing continuous time network data, we introduce a Cox-type counting process latent space model. To accomodate the event history observations, each edge is modeled as a counting process, with intensity comprising three components: a time-dependent baseline function, an individual-level degree heterogeneity parameter, and a low-rank embedding for the interaction effects. A nuclear-norm penalized likelihood estimator is developed, and its oracle error bounds are established. Additionally, we discuss a several ongoing directions for this work.
|
97 |
Automatisering av CPV- klassificering : En studie om Large Language Models i kombination med word embeddings kan lösa CPV-kategorisering av offentliga upphandlingar.Andersson, Niklas, Andersson Sjöberg, Hanna January 2024 (has links)
Denna studie utforskar användningen av Large Language Models och word embeddings för attautomatisera kategoriseringen av CPV-koder inom svenska offentliga upphandlingar. Tidigarestudier har inte lyckats uppnå tillförlitlig kategorisering, men detta experiment testar en nymetod som innefattar LLM-modellerna Mistral och Llama3 samt FastText word embeddings. Resultaten visar att även om studiens lösning korrekt kan identifiera vissa CPV-huvudgrupper, är dess övergripande prestanda låg med ett resultat på 12% för en helt korrekt klassificering av upphandlingar och 35% för en delvis korrekt klassificering med minst en korrekt funnen CPV-huvudgrupp. Förbättringar behövs både när det kommer till korrekthet och noggrannhet. Studien bidrar till forskningsfältet genom att påvisa de utmaningar och potentiella lösningar som finns för automatiserad kategorisering av offentliga upphandlingar. Den föreslår även framtida forskning som omfattar användningen av större och mer avancerade modeller för att adressera de identifierade utmaningarna.
|
98 |
Semantic Structuring Of Digital Documents: Knowledge Graph Generation And EvaluationLuu, Erik E 01 June 2024 (has links) (PDF)
In the era of total digitization of documents, navigating vast and heterogeneous data landscapes presents significant challenges for effective information retrieval, both for humans and digital agents. Traditional methods of knowledge organization often struggle to keep pace with evolving user demands, resulting in suboptimal outcomes such as information overload and disorganized data. This thesis presents a case study on a pipeline that leverages principles from cognitive science, graph theory, and semantic computing to generate semantically organized knowledge graphs. By evaluating a combination of different models, methodologies, and algorithms, the pipeline aims to enhance the organization and retrieval of digital documents. The proposed approach focuses on representing documents as vector embeddings, clustering similar documents, and constructing a connected and scalable knowledge graph. This graph not only captures semantic relationships between documents but also ensures efficient traversal and exploration. The practical application of the system is demonstrated in the context of digital libraries and academic research, showcasing its potential to improve information management and discovery. The effectiveness of the pipeline is validated through extensive experiments using contemporary open-source tools.
|
99 |
Methods, algorithms and impossibility results for machine learning on graphsSotiropoulos, Konstantinos 03 February 2025 (has links)
2023 / In recent years, there has been a remarkable increase in the use of machine learning techniques for analyzing graphs and their associated applications such as node classification, link prediction, community detection, and generating new graph instances with desired characteristics. This motivates the desire to create innovative and effective algorithms, as well as explore the potential and constraints of modern deep learning techniques, which have garnered considerable attention. This dissertation makes contributions in both of these areas. First, we propose innovative and scalable methods that rely solely on local node information for both unsupervised and supervised graph learning tasks. Specifically, we emphasize the significance of local triangle counts in community detection and introduce a novel triangle-aware spectral sparsification algorithm that enhances the efficiency of this task. Secondly, we analyze a Twitter dataset and create a supervised learning framework that leverages the multiple layers of interaction among Twitter users, resulting in a more precise prediction of new links among them. The emergence of deep learning has sparked interest in the use of unsupervised node embeddings, which are low-dimensional vector representations of nodes, and have become the primary tool in many graph-based machine learning tasks. A fundamental question arises: Can real-world networks be accurately represented in a low-dimensional space? We contribute to the understanding of node embeddings in two significant ways. Firstly, we prove that any graph with bounded maximum degree can be embedded in low dimensions, and we offer an algorithm that accurately embeds real-world networks in a few dimensions, typically in the order of tenths. Secondly, we explore contemporary embedding techniques and find that their embeddings are not always precise, as different graphs can have similar low-dimensional representations. However, despite the lack of exactness, these methods successfully encode sufficient information for high performance on node classification tasks. Finally, we study graph generative models under a unique novel criterion: their ability to generate graphs that are simultaneously edge-diverse and rich in small-sized dense subgraphs. We show the limitations of edge independent graph generative models and develop a hierarchy of models that are progressively more powerful in terms of mimicking better real-world networks. We complement our analysis with simple baseline methods relying on dense subgraph detection that perform competitively against more complex methods.
|
100 |
Determining Event Outcomes from Social MediaMurugan, Srikala 05 1900 (has links)
An event is something that happens at a time and location. Events include major life events such as graduating college or getting married, and also simple day-to-day activities such as commuting to work or eating lunch. Most work on event extraction detects events and the entities involved in events. For example, cooking events will usually involve a cook, some utensils and appliances, and a final product. In this work, we target the task of determining whether events result in their expected outcomes. Specifically, we target cooking and baking events, and characterize event outcomes into two categories. First, we distinguish whether something edible resulted from the event. Second, if something edible resulted, we distinguish between perfect, partial and alternative outcomes. The main contributions of this thesis are a corpus of 4,000 tweets annotated with event outcome information and experimental results showing that the task can be automated. The corpus includes tweets that have only text as well as tweets that have text and an image.
|
Page generated in 0.0252 seconds