Spelling suggestions: "subject:"[een] WORD EMBEDDINGS"" "subject:"[enn] WORD EMBEDDINGS""
11 |
Analýza textových používateľských hodnotení vybranej skupiny produktovValovič, Roman January 2019 (has links)
This work focuses on the design of a system that identifies frequently discussed product features in product reviews, summarizes them, and displays them to the user in terms of sentiment. The work deals with the issue of natural language processing, with a specific focus on Czech languague. The reader will be introduced the methods of preprocessing the text and their impact on the quality of the analysis results. The identification of the mainly discussed products features is carried out by cluster analysis using the K-Means algorithm, where we assume that sufficiently internally homogeneous clusters will represent the individual features of the products. A new area that will be explored in this work is the representation of documents using the Word embeddings technique, and its potential of using vector space as input for machine learning algorithms.
|
12 |
Explorations in Word Embeddings : graph-based word embedding learning and cross-lingual contextual word embedding learning / Explorations de plongements lexicaux : apprentissage de plongements à base de graphes et apprentissage de plongements contextuels multilinguesZhang, Zheng 18 October 2019 (has links)
Les plongements lexicaux sont un composant standard des architectures modernes de traitement automatique des langues (TAL). Chaque fois qu'une avancée est obtenue dans l'apprentissage de plongements lexicaux, la grande majorité des tâches de traitement automatique des langues, telles que l'étiquetage morphosyntaxique, la reconnaissance d'entités nommées, la recherche de réponses à des questions, ou l'inférence textuelle, peuvent en bénéficier. Ce travail explore la question de l'amélioration de la qualité de plongements lexicaux monolingues appris par des modèles prédictifs et celle de la mise en correspondance entre langues de plongements lexicaux contextuels créés par des modèles préentraînés de représentation de la langue comme ELMo ou BERT.Pour l'apprentissage de plongements lexicaux monolingues, je prends en compte des informations globales au corpus et génère une distribution de bruit différente pour l'échantillonnage d'exemples négatifs dans word2vec. Dans ce but, je précalcule des statistiques de cooccurrence entre mots avec corpus2graph, un paquet Python en source ouverte orienté vers les applications en TAL : il génère efficacement un graphe de cooccurrence à partir d'un grand corpus, et lui applique des algorithmes de graphes tels que les marches aléatoires. Pour la mise en correspondance translingue de plongements lexicaux, je relie les plongements lexicaux contextuels à des plongements de sens de mots. L'algorithme amélioré de création d'ancres que je propose étend également la portée des algorithmes de mise en correspondance de plongements lexicaux du cas non-contextuel au cas des plongements contextuels. / Word embeddings are a standard component of modern natural language processing architectures. Every time there is a breakthrough in word embedding learning, the vast majority of natural language processing tasks, such as POS-tagging, named entity recognition (NER), question answering, natural language inference, can benefit from it. This work addresses the question of how to improve the quality of monolingual word embeddings learned by prediction-based models and how to map contextual word embeddings generated by pretrained language representation models like ELMo or BERT across different languages.For monolingual word embedding learning, I take into account global, corpus-level information and generate a different noise distribution for negative sampling in word2vec. In this purpose I pre-compute word co-occurrence statistics with corpus2graph, an open-source NLP-application-oriented Python package that I developed: it efficiently generates a word co-occurrence network from a large corpus, and applies to it network algorithms such as random walks. For cross-lingual contextual word embedding mapping, I link contextual word embeddings to word sense embeddings. The improved anchor generation algorithm that I propose also expands the scope of word embedding mapping algorithms from context independent to contextual word embeddings.
|
13 |
Biomedical concept association and clustering using word embeddingsShah, Setu 12 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Biomedical data exists in the form of journal articles, research studies, electronic health records, care guidelines, etc. While text mining and natural language processing tools have been widely employed across various domains, these are just taking off in the healthcare space.
A primary hurdle that makes it difficult to build artificial intelligence models that use biomedical data, is the limited amount of labelled data available. Since most models rely on supervised or semi-supervised methods, generating large amounts of pre-processed labelled data that can be used for training purposes becomes extremely costly. Even for datasets that are labelled, the lack of normalization of biomedical concepts further affects the quality of results produced and limits the application to a restricted dataset. This affects reproducibility of the results and techniques across datasets, making it difficult to deploy research solutions to improve healthcare services.
The research presented in this thesis focuses on reducing the need to create labels for biomedical text mining by using unsupervised recurrent neural networks. The proposed method utilizes word embeddings to generate vector representations of biomedical concepts based on semantics and context. Experiments with unsupervised clustering of these biomedical concepts show that concepts that are similar to each other are clustered together. While this clustering captures different synonyms of the same concept, it also captures the similarities between various diseases and the symptoms that those diseases are symptomatic of.
To test the performance of the concept vectors on corpora of documents, a document vector generation method that utilizes these concept vectors is also proposed. The document vectors thus generated are used as an input to clustering algorithms, and the results show that across multiple corpora, the proposed methods of concept and document vector generation outperform the baselines and provide more meaningful clustering. The applications of this document clustering are huge, especially in the search and retrieval space, providing clinicians, researchers and patients more holistic and comprehensive results than relying on the exclusive term that they search for.
At the end, a framework for extracting clinical information that can be mapped to electronic health records from preventive care guidelines is presented. The extracted information can be integrated with the clinical decision support system of an electronic health record. A visualization tool to better understand and observe patient trajectories is also explored. Both these methods have potential to improve the preventive care services provided to patients.
|
14 |
Determining Event Outcomes from Social MediaMurugan, Srikala 05 1900 (has links)
An event is something that happens at a time and location. Events include major life events such as graduating college or getting married, and also simple day-to-day activities such as commuting to work or eating lunch. Most work on event extraction detects events and the entities involved in events. For example, cooking events will usually involve a cook, some utensils and appliances, and a final product. In this work, we target the task of determining whether events result in their expected outcomes. Specifically, we target cooking and baking events, and characterize event outcomes into two categories. First, we distinguish whether something edible resulted from the event. Second, if something edible resulted, we distinguish between perfect, partial and alternative outcomes. The main contributions of this thesis are a corpus of 4,000 tweets annotated with event outcome information and experimental results showing that the task can be automated. The corpus includes tweets that have only text as well as tweets that have text and an image.
|
15 |
Natural Language Processing, Statistical Inference, and American Foreign PolicyLauretig, Adam M. 06 November 2019 (has links)
No description available.
|
16 |
Automated Software Defect LocalizationYe, Xin 23 September 2016 (has links)
No description available.
|
17 |
News Analytics for Global Infectious Disease SurveillanceGhosh, Saurav 29 November 2017 (has links)
Traditional disease surveillance can be augmented with a wide variety of open sources, such as online news media, twitter, blogs, and web search records. Rapidly increasing volumes of these open sources are proving to be extremely valuable resources in helping analyze, detect, and forecast outbreaks of infectious diseases, especially new diseases or diseases spreading to new regions. However, these sources are in general unstructured (noisy) and construction of surveillance tools ranging from real-time disease outbreak monitoring to construction of epidemiological line lists involves considerable human supervision. Intelligent modeling of such sources using text mining methods such as, topic models, deep learning and dependency parsing can lead to automated generation of the mentioned surveillance tools. Moreover, real-time global availability of these open sources from web-based bio-surveillance systems, such as HealthMap and WHO Disease Outbreak News (DONs) can aid in development of generic tools which will be applicable to a wide range of diseases (rare, endemic and emerging) across different regions of the world.
In this dissertation, we explore various methods of using internet news reports to develop generic surveillance tools which can supplement traditional surveillance systems and aid in early detection of outbreaks. We primarily investigate three major problems related to infectious disease surveillance as follows. (i) Can trends in online news reporting monitor and possibly estimate infectious disease outbreaks? We introduce approaches that use temporal topic models over HealthMap corpus for detecting rare and endemic disease topics as well as capturing temporal trends (seasonality, abrupt peaks) for each disease topic. The discovery of temporal topic trends is followed by time-series regression techniques to estimate future disease incidence. (ii) In the second problem, we seek to automate the creation of epidemiological line lists for emerging diseases from WHO DONs in a near real-time setting. For this purpose, we formulate Guided Epidemiological Line List (GELL), an approach that combines neural word embeddings with information extracted from dependency parse-trees at the sentence level to extract line list features. (iii) Finally, for the third problem, we aim to characterize diseases automatically from HealthMap corpus using a disease-specific word embedding model which were subsequently evaluated against human curated ones for accuracies. / Ph. D. / Infectious Disease Outbreaks are a threat to public health and economic stability of many countries. Traditional Disease Surveillance data released by organizations, such as CDC, ProMED is delayed and therefore, not reliable for real-time monitoring of infectious disease outbreaks. Recently, open source indicators, such as online news sources and social media sources (Twitter) have been shown to be effective in monitoring infectious disease outbreaks in real-time due to their volume, ease of availability and citizen participation. This dissertation focuses on developing multiple data analytic tools which perform automated analysis of online disease-related news articles with an aim to characterize infectious diseases and monitor their spatial and temporal progression in real-time. We show that temporal trends extracted from online news articles can be used to capture dynamics of multiple disease outbreaks, such as whooping cough outbreak in U.S. during summer of 2012, periodic outbreaks of H7N9 in China during 2013-2014 and emerging MERS outbreak in Saudi Arabia. However, online news reporting during infectious disease outbreaks is driven by interest and therefore, news coverage for certain diseases can be inconsistent over time leading to erroneous surveillance.
|
18 |
Automatisering av CPV- klassificering : En studie om Large Language Models i kombination med word embeddings kan lösa CPV-kategorisering av offentliga upphandlingar.Andersson, Niklas, Andersson Sjöberg, Hanna January 2024 (has links)
Denna studie utforskar användningen av Large Language Models och word embeddings för attautomatisera kategoriseringen av CPV-koder inom svenska offentliga upphandlingar. Tidigarestudier har inte lyckats uppnå tillförlitlig kategorisering, men detta experiment testar en nymetod som innefattar LLM-modellerna Mistral och Llama3 samt FastText word embeddings. Resultaten visar att även om studiens lösning korrekt kan identifiera vissa CPV-huvudgrupper, är dess övergripande prestanda låg med ett resultat på 12% för en helt korrekt klassificering av upphandlingar och 35% för en delvis korrekt klassificering med minst en korrekt funnen CPV-huvudgrupp. Förbättringar behövs både när det kommer till korrekthet och noggrannhet. Studien bidrar till forskningsfältet genom att påvisa de utmaningar och potentiella lösningar som finns för automatiserad kategorisering av offentliga upphandlingar. Den föreslår även framtida forskning som omfattar användningen av större och mer avancerade modeller för att adressera de identifierade utmaningarna.
|
19 |
O uso de recursos linguísticos para mensurar a semelhança semântica entre frases curtas através de uma abordagem híbridaSilva, Allan de Barcelos 14 December 2017 (has links)
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-04-04T11:46:54Z
No. of bitstreams: 1
Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5) / Made available in DSpace on 2018-04-04T11:46:55Z (GMT). No. of bitstreams: 1
Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5)
Previous issue date: 2017-12-14 / Nenhuma / Na área de Processamento de Linguagem Natural, a avaliação da similaridade semântica textual é considerada como um elemento importante para a construção de recursos em diversas frentes de trabalho, tais como a recuperação de informações, a classificação de textos, o agrupamento de documentos, as aplicações de tradução, a interação através de diálogos, entre outras. A literatura da área descreve aplicações e técnicas voltadas, em grande parte, para a língua inglesa. Além disso, observa-se o uso prioritário de recursos probabilísticos, enquanto os aspectos linguísticos são utilizados de forma incipiente. Trabalhos na área destacam que a linguística possui um papel fundamental na avaliação de similaridade semântica textual, justamente por ampliar o potencial dos métodos exclusivamente probabilísticos e evitar algumas de suas falhas, que em boa medida são resultado da falta de tratamento mais aprofundado de aspectos da língua. Este contexto é potencializado no tratamento de frases curtas, que consistem no maior campo de utilização das técnicas de similaridade semântica textual, pois este tipo de sentença é composto por um conjunto reduzido de informações, diminuindo assim a capacidade de tratamento probabilístico eficiente. Logo, considera-se vital a identificação e aplicação de recursos a partir do estudo mais aprofundado da língua para melhor compreensão dos aspectos que definem a similaridade entre sentenças. O presente trabalho apresenta uma abordagem para avaliação da similaridade semântica textual em frases curtas no idioma português brasileiro. O principal diferencial apresentado é o uso de uma abordagem híbrida, na qual tanto os recursos de representação distribuída como os aspectos léxicos e linguísticos são utilizados. Para a consolidação do estudo, foi definida uma metodologia que permite a análise de diversas combinações de recursos, possibilitando a avaliação dos ganhos que são introduzidos com a ampliação de aspectos linguísticos e também através de sua combinação com o conhecimento gerado por outras técnicas. A abordagem proposta foi avaliada com relação a conjuntos de dados conhecidos na literatura (evento PROPOR 2016) e obteve bons resultados. / One of the areas of Natural language processing (NLP), the task of assessing the Semantic Textual Similarity (STS) is one of the challenges in NLP and comes playing an increasingly important role in related applications. The STS is a fundamental part of techniques and approaches in several areas, such as information retrieval, text classification, document clustering, applications in the areas of translation, check for duplicates and others. The literature describes the experimentation with almost exclusive application in the English language, in addition to the priority use of probabilistic resources, exploring the linguistic ones
in an incipient way. Since the linguistic plays a fundamental role in the analysis of semantic textual similarity between short sentences, because exclusively probabilistic works fails in some way (e.g. identification of far or close related sentences, anaphora) due to lack of understanding of the language. This fact stems from the few non-linguistic information in short sentences. Therefore, it is vital to identify and apply linguistic resources for better understand what make two or more sentences similar or not. The current work presents a hybrid approach, in which are used both of distributed, lexical and linguistic aspects for an evaluation of semantic textual similarity between short sentences in Brazilian Portuguese. We evaluated proposed approach with well-known and respected datasets in the literature (PROPOR 2016) and obtained good results.
|
20 |
Using Word Embeddings to Explore the Language of Depression on TwitterGopchandani, Sandhya 01 January 2019 (has links)
How do people discuss mental health on social media? Can we train a computer program to recognize differences between discussions of depression and other topics? Can an algorithm predict that someone is depressed from their tweets alone? In this project, we collect tweets referencing “depression” and “depressed” over a seven year period, and train word embeddings to characterize linguistic structures within the corpus. We find that neural word embeddings capture the contextual differences between “depressed” and “healthy” language. We also looked at how context around words may have changed over time to get deeper understanding of contextual shifts in the word usage. Finally, we trained a deep learning network on a much smaller collection of tweets authored by individuals formally diagnosed with depression. The best performing model for the prediction task is Convolutional LSTM (CNN-LSTM) model with a F-score of 69% on test data. The results suggest social media could serve as a valuable screening tool for mental health.
|
Page generated in 0.0495 seconds