Spelling suggestions: "subject:"[een] WORD EMBEDDINGS"" "subject:"[enn] WORD EMBEDDINGS""
1 |
Biomedical Semantic Embeddings: Using Hybrid Sentences to Construct Biomedical Word Embeddings and its ApplicationsShaik, Arshad 12 1900 (has links)
Word embeddings is a useful method that has shown enormous success in various NLP tasks, not only in open domain but also in biomedical domain. The biomedical domain provides various domain specific resources and tools that can be exploited to improve performance of these word embeddings. However, most of the research related to word embeddings in biomedical domain focuses on analysis of model architecture, hyper-parameters and input text. In this paper, we use SemMedDB to design new sentences called `Semantic Sentences'. Then we use these sentences in addition to biomedical text as inputs to the word embedding model. This approach aims at introducing biomedical semantic types defined by UMLS, into the vector space of word embeddings. The semantically rich word embeddings presented here rivals state of the art biomedical word embedding in both semantic similarity and relatedness metrics up to 11%. We also demonstrate how these semantic types in word embeddings can be utilized.
|
2 |
Detecting Lexical Semantic Change Using Probabilistic Gaussian Word EmbeddingsMoss, Adam January 2020 (has links)
In this work, we test two novel methods of using word embeddings to detect lexical semantic change, attempting to overcome limitations associated with conventional approaches to this problem. Using a diachronic corpus spanning over a hundred years, we generate word embeddings for each decade with the intention of evaluating how meaning changes are represented in embeddings for the same word across time. Our approach differs from previous works in this field in that we encode words as probabilistic Gaussian distributions and bimodal probabilistic Gaussian mixtures, rather than conventional word vectors. We provide a discussion and analysis of our results, comparing the approaches we implemented with those used in previous works. We also conducted further analysis on whether additional information regarding the nature of semantic change could be discerned from particular qualities of the embeddings we generated for our experiments. In our results, we find that encoding words as probabilistic Gaussian embeddings can provide an enhanced degree of reliability with regard to detecting lexical semantic change. Furthermore, we are able to represent additional information regarding the nature of such changes through the variance of these embeddings. Encoding words as bimodal Gaussian mixtures however is generally unsuccessful for this task, proving to be not reliable enough at distinguishing between discrete senses to effectively detect and measure such changes. We provide potential explanations for the results we observe, and propose improvements that can be made to our approach to potentially improve performance.
|
3 |
A recurrent neural network architecture for biomedical event trigger classificationBopaiah, Jeevith 01 January 2018 (has links)
A “biomedical event” is a broad term used to describe the roles and interactions between entities (such as proteins, genes and cells) in a biological system. The task of biomedical event extraction aims at identifying and extracting these events from unstructured texts. An important component in the early stage of the task is biomedical trigger classification which involves identifying and classifying words/phrases that indicate an event. In this thesis, we present our work on biomedical trigger classification developed using the multi-level event extraction dataset. We restrict the scope of our classification to 19 biomedical event types grouped under four broad categories - Anatomical, Molecular, General and Planned. While most of the existing approaches are based on traditional machine learning algorithms which require extensive feature engineering, our model relies on neural networks to implicitly learn important features directly from the text. We use natural language processing techniques to transform the text into vectorized inputs that can be used in a neural network architecture. As per our knowledge, this is the first time neural attention strategies are being explored in the area of biomedical trigger classification. Our best results were obtained from an ensemble of 50 models which produced a micro F-score of 79.82%, an improvement of 1.3% over the previous best score.
|
4 |
TimeLink: Visualizing Diachronic Word Embeddings and TopicsWilliams, Lemara Faith 11 June 2024 (has links)
The task of analyzing a collection of documents generated over time is daunting. A natural way to ease the task is by summarizing documents into the topics that exist within these documents. The temporal aspect of topics can frame relevance based on when topics are introduced and when topics stop being mentioned. It creates trends and patterns that can be traced by individual key terms taken from the corpus. If trends are being established, there must be a way to visualize them through the key terms. Creating a visual system to support this analysis can help users quickly gain insights from the data, significantly easing the burden from the original analysis technique. However, creating a visual system for terms is not easy. Work has been done to develop word embeddings, allowing researchers to treat words like any number. This makes it possible to create simple charts based on word embeddings like scatter plots. However, these methods are inefficient due to loss of effectiveness with multiple time slices and point overlap. A visualization method that addresses these problems while also visualizing diachronic word embeddings in an interesting way with added semantic meaning is hard to find. These problems are managed through TimeLink. TimeLink is proposed as a dashboard system to help users gain insights from the movement of diachronic word embeddings. It comprises a Sankey diagram showing the path of a selected key term to a cluster in a time period. This local cluster is also mapped to a global topic based on an original corpus of documents from which the key terms are drawn. On the dashboard, different tools are given to users to aid in a focused analysis, such as filtering key terms and emphasizing specific clusters. TimeLink provides insightful visualizations focused on temporal word embeddings while maintaining the insights provided by global topic evolution, advancing our understanding of how topics evolve over time. / Master of Science / The task of analyzing documents collected over time is daunting. Grouping documents into topics can help frame relevancy based on when topics are introduced and hampered. The creation of topics also enables the ability to visualize trends and patterns. Creating a visual system to support this analysis can help users quickly gain insights from the data, significantly easing the burden from the original analysis technique of browsing individual documents. A visualization system for this analysis typically focuses on the terms that affect established topics. Some visualization methods, like scatter plots, implement this but can be inefficient due to loss of effectiveness as more data is introduced. TimeLink is proposed as a dashboard system to aid users in drawing insights from the development of terms over time. In addition to addressing problems in other visualizations, it visualizes the movement of terms intuitively and adds semantic meaning. TimeLink provides insightful visualizations focused on the movement of terms while maintaining the insights provided by global topic evolution, advancing our understanding of how topics evolve over time.
|
5 |
Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource LanguagesBhowmik, Kowshik January 2022 (has links)
No description available.
|
6 |
[en] A FAST AND SPACE-ECONOMICAL APPROACH TO WORD MOVER S DISTANCE / [pt] UMA ABORDAGEM RÁPIDA E ECONÔMICA PARA WORD MOVER S DISTANCEMATHEUS TELLES WERNER 02 April 2020 (has links)
[pt] O Word Mover s Distance (WMD) proposto por Kusner et al.
[ICML,2015] é uma função de distância entre documentos que se aproveita
das relações semânticas entre palavras extraidas por suas Word Embeddings. Essa função de distância se mostrou bastante eficaz, obtendo taxas
de erro estado da arte para problemas de classificação, porém ao mesmo
tempo inviável para largas coleções ou grandes documentos devido a ser
necessário computar um problema de transporte em um grafo bipartido
completo para cada par de documentos.
Assumindo algumas hipóteses, que são respaldadas por propriedades empíricas das distâncias entre as Word Embeddings, nós simplificamos o WMD
de forma a obter uma nova função de distância o qual requer a solução
de um problema de fluxo máximo em um grafo esparço, que pode ser resolvido mais rapidamente do que um problema de transporte em um grafo
denso. Nossos experimentos mostram que conseguimos obter ganhos de performance até três ordens de magnitude acima do WMD enquanto mantendo
as mesmas taxas de erro na tarefa de classificação de documentos. / [en] The Word Mover s Distance (WMD) proposed in Kusner et. al.
[ICML,2015] is a distance between documents that takes advantage of semantic relations among words that are captured by their Word Embeddings.
This distance proved to be quite effective, obtaining state-of-the-art error
rates for classification tasks, but also impracticable for large collections or
documents because it needs to compute a transportation problem on a complete bipartite graph for each pair of documents.
By using assumptions, that are supported by empirical properties of the
distances between Word Embeddings, we simplify WMD so that we obtain a
new distance whose computation requires the solution of a max flow problem
in a sparse graph, which can be solved much faster than the transportation
problem in a dense graph. Our experiments show that we can obtain a
performance gain up to three orders of magnitude over WMD while maintaining
the same error rates in document classification tasks.
|
7 |
Word embeddings and Patient records : The identification of MRI risk patientsKindberg, Erik January 2019 (has links)
Identification of risks ahead of MRI examinations is identified as a cumbersome and time-consuming process at the Linköping University Hospital radiology clinic. The hospital staff often have to search through large amounts of unstructured patient data to find information about implants. Word embeddings has been identified as a possible tool to speed up this process. The purpose of this thesis is to evaluate this method, and that is done by training a Word2Vec model on patient journal data and analyzing the close neighbours of key search words by calculating cosine similarity. The 50 closest neighbours of each search words are categorized and annotated as relevant to the task of identifying risk patients ahead of MRI examinations or not. 10 search words were explored, leading to a total of 500 terms being annotated. In total, 14 different categories were observed in the result and out of these 8 were considered relevant. Out of the 500 terms, 340 (68%) were considered relevant. In addition, 48 implant models could be observed which are particularly interesting because if a patient have an implant, hospital staff needs to determine it’s exact model and the MRI conditions of that model. Overall these findings points towards a positive answer for the aim of the thesis, although further developments are needed.
|
8 |
Zpracování češtiny s využitím kontextualizované reprezentace / Czech NLP with Contextualized EmbeddingsVysušilová, Petra January 2021 (has links)
With the increasing amount of digital data in the form of unstructured text, the importance of natural language processing (NLP) increases. The most suc- cessful technologies of recent years are deep neural networks. This work applies the state-of-the-art methods, namely transfer learning of Bidirectional Encoders Representations from Transformers (BERT), on three Czech NLP tasks: part- of-speech tagging, lemmatization and sentiment analysis. We applied BERT model with a simple classification head on three Czech sentiment datasets: mall, facebook, and csfd, and we achieved state-of-the-art results. We also explored several possible architectures for tagging and lemmatization and obtained new state-of-the-art results in both tagging and lemmatization with fine-tunning ap- proach on data from Prague Dependency Treebank. Specifically, we achieved accuracy 98.57% for tagging, 99.00% for lemmatization, and 98.19% for joint accuracy of both tasks. Best models for all tasks are publicly available. 1
|
9 |
Neural Methods for Event Extraction / Méthodes neuronales pour l'extraction d'événementsBoroş, Emanuela 27 September 2018 (has links)
Du point de vue du traitement automatique des langues (TAL), l’extraction des événements dans les textes est la forme la plus complexe des processus d’extraction d’information, qui recouvrent de façon plus générale l’extraction des entités nommées et des relations qui les lient dans les textes. Le cas des événements est particulièrement ardu car un événement peut être assimilé à une relation n-aire ou à une configuration de relations. Alors que la recherche en extraction d’information a largement bénéficié des jeux de données étiquetés manuellement pour apprendre des modèles permettant l’analyse des textes, la disponibilité de ces ressources reste un problème important. En outre, de nombreuses approches en extraction d’information fondées sur l’apprentissage automatique reposent sur la possibilité d’extraire à partir des textes de larges en sembles de traits définis manuellement grâce à des outils de TAL élaborés. De ce fait, l’adaptation à un nouveau domaine constitue un défi supplémentaire. Cette thèse présente plusieurs stratégies pour améliorer la performance d’un système d’extraction d’événements en utilisant des approches fondées sur les réseaux de neurones et en exploitant les propriétés morphologiques, syntaxiques et sémantiques des plongements de mots. Ceux-ci ont en effet l’avantage de ne pas nécessiter une modélisation a priori des connaissances du domaine et de générer automatiquement un ensemble de traits beaucoup plus vaste pour apprendre un modèle. Nous avons proposé plus spécifiquement différents modèles d’apprentissage profond pour les deux sous-tâches liées à l’extraction d’événements : la détection d’événements et la détection d’arguments. La détection d’événements est considérée comme une sous-tâche importante de l’extraction d’événements dans la mesure où la détection d’arguments est très directement dépendante de son résultat. La détection d’événements consiste plus précisément à identifier des instances d’événements dans les textes et à les classer en types d’événements précis. En préalable à l’introduction de nos nouveaux modèles, nous commençons par présenter en détail le modèle de l’état de l’art qui en constitue la base. Des expériences approfondies sont menées sur l’utilisation de différents types de plongements de mots et sur l’influence des différents hyperparamètres du modèle en nous appuyant sur le cadre d’évaluation ACE 2005, standard d’évaluation pour cette tâche. Nous proposons ensuite deux nouveaux modèles permettant d’améliorer un système de détection d’événements. L’un permet d’augmenter le contexte pris en compte lors de la prédiction d’une instance d’événement (déclencheur d’événement) en utilisant un contexte phrastique, tandis que l’autre exploite la structure interne des mots en profitant de connaissances morphologiques en apparence moins nécessaires mais dans les faits importantes. Nous proposons enfin de reconsidérer la détection des arguments comme une extraction de relation d’ordre supérieur et nous analysons la dépendance de cette détection vis-à-vis de la détection d’événements. / With the increasing amount of data and the exploding number data sources, the extraction of information about events, whether from the perspective of acquiring knowledge or from a more directly operational perspective, becomes a more and more obvious need. This extraction nevertheless comes up against a recurring difficulty: most of the information is present in documents in a textual form, thus unstructured and difficult to be grasped by the machine. From the point of view of Natural Language Processing (NLP), the extraction of events from texts is the most complex form of Information Extraction (IE) techniques, which more generally encompasses the extraction of named entities and relationships that bind them in the texts. The event extraction task can be represented as a complex combination of relations linked to a set of empirical observations from texts. Compared to relations involving only two entities, there is, therefore, a new dimension that often requires going beyond the scope of the sentence, which constitutes an additional difficulty. In practice, an event is described by a trigger and a set of participants in that event whose values are text excerpts. While IE research has benefited significantly from manually annotated datasets to learn patterns for text analysis, the availability of these resources remains a significant problem. These datasets are often obtained through the sustained efforts of research communities, potentially complemented by crowdsourcing. In addition, many machine learning-based IE approaches rely on the ability to extract large sets of manually defined features from text using sophisticated NLP tools. As a result, adaptation to a new domain is an additional challenge. This thesis presents several strategies for improving the performance of an Event Extraction (EE) system using neural-based approaches exploiting morphological, syntactic, and semantic properties of word embeddings. These have the advantage of not requiring a priori modeling domain knowledge and automatically generate a much larger set of features to learn a model. More specifically, we proposed different deep learning models for two sub-tasks related to EE: event detection and argument detection and classification. Event Detection (ED) is considered an important subtask of event extraction since the detection of arguments is very directly dependent on its outcome. ED specifically involves identifying instances of events in texts and classifying them into specific event types. Classically, the same event may appear as different expressions and these expressions may themselves represent different events in different contexts, hence the difficulty of the task. The detection of the arguments is based on the detection of the expression considered as triggering the event and ensures the recognition of the participants of the event. Among the difficulties to take into account, it should be noted that an argument can be common to several events and that it does not necessarily identify with an easily recognizable named entity. As a preliminary to the introduction of our proposed models, we begin by presenting in detail a state-of-the-art model which constitutes the baseline. In-depth experiments are conducted on the use of different types of word embeddings and the influence of the different hyperparameters of the model using the ACE 2005 evaluation framework, a standard evaluation for this task. We then propose two new models to improve an event detection system. One allows increasing the context taken into account when predicting an event instance by using a sentential context, while the other exploits the internal structure of words by taking advantage of seemingly less obvious but essentially important morphological knowledge. We also reconsider the detection of arguments as a high-order relation extraction and we analyze the dependence of arguments on the ED task.
|
10 |
Longitudinal Comparison of Word Associations in Shallow Word EmbeddingsGeetanjali Bihani (8815607) 08 May 2020 (has links)
Word embeddings are utilized in various natural language processing tasks. Although effective in helping computers learn linguistic patterns employed in natural language, word embeddings also tend to learn unwanted word associations. This affects the performance of NLP tasks, as unwanted word associations propagate and amplify biases. Current word association evaluation methods for word embeddings do not account for changes in word embedding models and training corpora, when creating the rubric for word association evaluation. Current literature also lacks a consistent training and evaluation protocol for comparison of word associations across varying word embedding models and varying training corpora. In order to address this gap in prior literature, this research aims to evaluate different types of word associations, not limited to gender, racial or religious attributes, incorporating and evaluating the diachronic and variable nature of words over text data collected over a period of 200 years. This thesis introduces a framework to track changes in word associations between neutral words (proper nouns) and attributes (adjectives), across different word embedding models, over a temporal dimension, by evaluating clustering tendencies between neutral words (proper nouns) and attributive words (adjectives) over five different word embedding frameworks: Word2vec (CBOW), Word2vec (Skip-gram), GloVe, fastText (CBOW) and fastText (Skip-gram) and 20 decades of text data from 1810s to 2000s. <a>Finally, various cluster level and corpus level measurements will be compared across aforementioned word embedding frameworks, to find how</a> word associations evolve with changes in the embedding model and the training corpus.
|
Page generated in 0.0454 seconds