Spelling suggestions: "subject:"distributional semantics"" "subject:"distributional emantics""
21 |
Aide à l'identification de relations lexicales au moyen de la sémantique distributionnelle et son application à un corpus bilingue du domaine de l'environnementBernier-Colborne, Gabriel 08 1900 (has links)
L’analyse des relations lexicales est une des étapes principales du travail terminologique. Cette tâche, qui consiste à établir des liens entre des termes dont les sens sont reliés, peut être facilitée par des méthodes computationnelles, notamment les techniques de la sémantique distributionnelle. En estimant la similarité sémantique des mots à partir
d’un corpus, ces techniques peuvent faciliter l’analyse des relations lexicales.
La qualité des résultats offerts par les méthodes distributionnelles dépend, entre autres, des nombreuses décisions qui doivent être prises lors de leur mise en œuvre, notamment le choix et le paramétrage du modèle. Ces décisions dépendent, à leur tour, de divers facteurs liés à l’objectif visé, tels que la nature des relations lexicales que l’on souhaite détecter; celles-ci peuvent comprendre des relations paradigmatiques classiques telles que la (quasi-)synonymie (p. ex. conserver -> préserver), mais aussi d’autres relations telles que la dérivation syntaxique (p. ex. conserver -> conservation).
Cette thèse vise à développer un cadre méthodologique basé sur la sémantique distributionnelle pour l’analyse des relations lexicales à partir de corpus spécialisés. À cette fin, nous vérifions comment le choix, le paramétrage et l’interrogation d’un modèle distributionnel doivent tenir compte de divers facteurs liés au projet terminologique envisagé : le cadre descriptif adopté, les relations ciblées, la partie du discours des termes à décrire et la langue traitée (en l’occurrence, le français ou l’anglais).
Nous montrons que deux des relations les mieux détectées par l’approche distributionnelle sont la (quasi-)synonymie et la dérivation syntaxique, mais que les modèles qui captent le mieux ces deux types de relations sont très différents. Ainsi, les relations ciblées ont une influence importante sur la façon dont on doit paramétrer le modèle pour obtenir les meilleurs résultats possibles.
Un autre facteur à considérer est la partie du discours des termes à décrire. Nos résultats indiquent notamment que les relations entre verbes sont moins bien modélisées par cette approche que celles entre adjectifs ou entre noms.
Le cadre descriptif adopté pour un projet terminologique est également un facteur important à considérer lors de l’application de l’approche distributionnelle. Dans ce travail, nous comparons deux cadres descriptifs, l’un étant basé sur la sémantique lexicale et l’autre, sur la sémantique des cadres. Nos résultats indiquent que les méthodes distributionnelles détectent les termes qui évoquent le même cadre sémantique moins bien que certaines relations lexicales telles que la synonymie. Nous montrons que cet écart est attribuable au fait que les termes qui évoquent des cadres sémantiques comprennent une proportion importante de verbes et aux différences importantes entre les modèles qui produisent les meilleurs résultats pour la dérivation syntaxique et les relations paradigmatiques classiques telles que la synonymie.
En somme, nous évaluons deux modèles distributionnels différents, analysons systématiquement l’influence de leurs paramètres et vérifions comment cette influence varie en fonction de divers aspects du projet terminologique. Nous montrons de nombreux exemples de voisinages distributionnels, que nous explorons au moyen de graphes, et discutons les sources d’erreurs. Ce travail fournit ainsi des balises importantes pour l’application de méthodes distributionnelles dans le cadre du travail terminologique. / Identifying semantic relations is one of the main tasks involved in terminology work. This task, which aims to establish links between terms whose meanings are related, can be assisted by computational methods, including those based on distributional semantics. These methods estimate the semantic similarity of words based on corpus data, which can help terminologists identify semantic relations.
The quality of the results produced by distributional methods depends on several decisions that must be made when applying them, such as choosing a model and selecting its parameters. In turn, these decisions depend on various factors related to the target application, such as the types of semantic relations one wishes to identify. These can include typical paradigmatic relations such as (near-)synonymy (e.g. preserve -> protect), but also other relations such as syntactic derivation (e.g. preserve -> preservation).
This dissertation aims to further the development of a methodological framework based on distributional semantics for the identification of semantic relations using specialized corpora. To this end, we investigate how various aspects of terminology work must be accounted for when selecting a distributional semantic model and its parameters, as well as those of the method used to query the model. These aspects include the descriptive framework, the target relations, the part of speech of the terms being described, and the language (in this case, French or English).
Our results show that two of the relations that distributional semantic models capture most accurately are (near-)synonymy and syntactic derivation. However, the models that produce the best results for these two relations are very different. Thus, the target relations are an important factor to consider when choosing a model and tuning it to obtain the most accurate results.
Another factor that should be considered is the part of speech of the terms that are being worked on. Among other things, our results suggest that relations between verbs are not captured as accurately as those between nouns or adjectives by distributional semantic models.
The descriptive framework used for a given project is also an important factor to consider. In this work, we compare two descriptive frameworks, one based on lexical semantics and another based on frame semantics. Our results show that terms that evoke the same semantic frame are not captured as accurately as certain semantic relations, such as synonymy. We show that this is due to (at least) two reasons: a high percentage of frame-evoking terms are verbs, and the models that capture syntactic derivation most accurately are very different than those that work best for typical paradigmatic relations such as synonymy.
In summary, we evaluate two different distributional semantic models, we analyze the influence of their parameters, and we investigate how this influence varies with respect to various aspects of terminology work. We show many examples of distributional neighbourhoods, which we explore using graphs, and discuss sources of noise. This dissertation thus provides important guidelines for the use of distributional semantic models for terminology work.
|
22 |
Functional distributional semantics : learning linguistically informed representations from a precisely annotated corpusEmerson, Guy Edward Toh January 2018 (has links)
The aim of distributional semantics is to design computational techniques that can automatically learn the meanings of words from a body of text. The twin challenges are: how do we represent meaning, and how do we learn these representations? The current state of the art is to represent meanings as vectors - but vectors do not correspond to any traditional notion of meaning. In particular, there is no way to talk about 'truth', a crucial concept in logic and formal semantics. In this thesis, I develop a framework for distributional semantics which answers this challenge. The meaning of a word is not represented as a vector, but as a 'function', mapping entities (objects in the world) to probabilities of truth (the probability that the word is true of the entity). Such a function can be interpreted both in the machine learning sense of a classifier, and in the formal semantic sense of a truth-conditional function. This simultaneously allows both the use of machine learning techniques to exploit large datasets, and also the use of formal semantic techniques to manipulate the learnt representations. I define a probabilistic graphical model, which incorporates a probabilistic generalisation of model theory (allowing a strong connection with formal semantics), and which generates semantic dependency graphs (allowing it to be trained on a corpus). This graphical model provides a natural way to model logical inference, semantic composition, and context-dependent meanings, where Bayesian inference plays a crucial role. I demonstrate the feasibility of this approach by training a model on WikiWoods, a parsed version of the English Wikipedia, and evaluating it on three tasks. The results indicate that the model can learn information not captured by vector space models.
|
23 |
A corpus-based study of the causative alternation in English / Une analyse de corpus de l'alternance causative en anglaisRomain, Laurence 05 October 2018 (has links)
La présente recherche s’interroge sur la présumée dichotomie entre les alternances et les généralisations de surface dans le cadre théorique de la grammaire de constructions. Plus précisément,l’objectif de cette thèse est ternaire. Par l’analyse attentive d’une grande quantité de données, nous faisons une description détaillée de l’alternance causative en anglais (The fabric stretched vs. Joan stretched the fabric), nous proposons une méthode qui permet de mesurer la force d’alternance des verbes ainsi que la quantité de sens partagée entre les deux constructions, et, enfin, nous montrons que si l’on veut rendre compte des contraintes au niveau de la construction, l’on doit alors prendre en compte les généralisations de plus bas niveau, telles que les interactions entre le verbe et ses arguments dans le cadre de chaque construction. Afin d’ajouter au débat entre alternance et généralisations de surface, nous proposons une analyse détaillée des deux constructions qui forment l’alternance causative en anglais : la construction intransitive non-causative d’une part et la construction transitive causative de l’autre.Notre but est de mesurer la quantité de sens partagée par les deux constructions mais aussi démontrer en quoi ces deux constructions diffèrent. Dans cette optique, nous prenons en compte trois éléments: construction, verbe et thème (i.e. l’entité sujette à l’évènement dénoté par le verbe). Nous utilisons la sémantique distributionnelle pour la mesure des similarités sémantiques entre les divers thèmes employés avec chaque verbe dans chaque construction dans notre corpus.Ce groupement sémantique met en lumière les différents sens verbaux employés avec chaque construction et nous permet d’établir des généralisations quant aux contraintes qui s’appliquent au thème dans chaque construction. / The present research takes issue with the supposed dichotomy between alternations on the onehand and surface generalisations on the other, within the framework of construction grammar.More specifically the aim of this thesis is threefold. Through the careful analysis of a largedataset, we aim to provide a thorough description of the causative alternation in English (Thefabric stretched vs. Joan stretched the fabric), suggest a method that allows for a solid measure ofa verb’s alternation strength and of the amount of shared meaning between two constructions,and finally, show that in order to capture constraints at the level of the construction, one mustpay attention to lower level generalisations such as the interaction between verb and argumentswithin the scope of each construction.In an effort to add to the discussion on alternation vs. surface generalisations, we propose adetailed study of the two constructions that make up the causative alternation: the intransitivenon-transitive causative construction and the transitive causative construction. Our goal is tomeasure the amount of meaning shared by the two constructions and also show the differencesbetween the two. In order to do so we take three elements into account: construction, verband theme (i.e. the entity that undergoes the event denoted by the verb). We use distributionalsemantics to measure the semantic similarity of the various themes found with each verb andeach construction in our corpus. This grouping highlights the different verb senses used witheach construction and allows us to draw generalisations as to the constraints on the theme ineach construction.
|
24 |
A Markovian approach to distributional semanticsGrave, Edouard 20 January 2014 (has links) (PDF)
This thesis, which is organized in two independent parts, presents work on distributional semantics and on variable selection. In the first part, we introduce a new method for learning good word representations using large quantities of unlabeled sentences. The method is based on a probabilistic model of sentence, using a hidden Markov model and a syntactic dependency tree. The latent variables, which correspond to the nodes of the dependency tree, aim at capturing the meanings of the words. We develop an efficient algorithm to perform inference and learning in those models, based on online EM and approximate message passing. We then evaluate our models on intrinsic tasks such as predicting human similarity judgements or word categorization, and on two extrinsic tasks: named entity recognition and supersense tagging. In the second part, we introduce, in the context of linear models, a new penalty function to perform variable selection in the case of highly correlated predictors. This penalty, called the trace Lasso, uses the trace norm of the selected predictors, which is a convex surrogate of their rank, as the criterion of model complexity. The trace Lasso interpolates between the $\ell_1$-norm and $\ell_2$-norm. In particular, it is equal to the $\ell_1$-norm if all predictors are orthogonal and to the $\ell_2$-norm if all predictors are equal. We propose two algorithms to compute the solution of least-squares regression regularized by the trace Lasso, and perform experiments on synthetic datasets to illustrate the behavior of the trace Lasso.
|
25 |
Extracting Clinical Findings from Swedish Health Record TextSkeppstedt, Maria January 2014 (has links)
Information contained in the free text of health records is useful for the immediate care of patients as well as for medical knowledge creation. Advances in clinical language processing have made it possible to automatically extract this information, but most research has, until recently, been conducted on clinical text written in English. In this thesis, however, information extraction from Swedish clinical corpora is explored, particularly focusing on the extraction of clinical findings. Unlike most previous studies, Clinical Finding was divided into the two more granular sub-categories Finding (symptom/result of a medical examination) and Disorder (condition with an underlying pathological process). For detecting clinical findings mentioned in Swedish health record text, a machine learning model, trained on a corpus of manually annotated text, achieved results in line with the obtained inter-annotator agreement figures. The machine learning approach clearly outperformed an approach based on vocabulary mapping, showing that Swedish medical vocabularies are not extensive enough for the purpose of high-quality information extraction from clinical text. A rule and cue vocabulary-based approach was, however, successful for negation and uncertainty classification of detected clinical findings. Methods for facilitating expansion of medical vocabulary resources are particularly important for Swedish and other languages with less extensive vocabulary resources. The possibility of using distributional semantics, in the form of Random indexing, for semi-automatic vocabulary expansion of medical vocabularies was, therefore, evaluated. Distributional semantics does not require that terms or abbreviations are explicitly defined in the text, and it is, thereby, a method suitable for clinical corpora. Random indexing was shown useful for extending vocabularies with medical terms, as well as for extracting medical synonyms and abbreviation dictionaries.
|
26 |
Semantic Spaces of Clinical Text : Leveraging Distributional Semantics for Natural Language Processing of Electronic Health RecordsHenriksson, Aron January 2013 (has links)
The large amounts of clinical data generated by electronic health record systems are an underutilized resource, which, if tapped, has enormous potential to improve health care. Since the majority of this data is in the form of unstructured text, which is challenging to analyze computationally, there is a need for sophisticated clinical language processing methods. Unsupervised methods that exploit statistical properties of the data are particularly valuable due to the limited availability of annotated corpora in the clinical domain. Information extraction and natural language processing systems need to incorporate some knowledge of semantics. One approach exploits the distributional properties of language – more specifically, term co-occurrence information – to model the relative meaning of terms in high-dimensional vector space. Such methods have been used with success in a number of general language processing tasks; however, their application in the clinical domain has previously only been explored to a limited extent. By applying models of distributional semantics to clinical text, semantic spaces can be constructed in a completely unsupervised fashion. Semantic spaces of clinical text can then be utilized in a number of medically relevant applications. The application of distributional semantics in the clinical domain is here demonstrated in three use cases: (1) synonym extraction of medical terms, (2) assignment of diagnosis codes and (3) identification of adverse drug reactions. To apply distributional semantics effectively to a wide range of both general and, in particular, clinical language processing tasks, certain limitations or challenges need to be addressed, such as how to model the meaning of multiword terms and account for the function of negation: a simple means of incorporating paraphrasing and negation in a distributional semantic framework is here proposed and evaluated. The notion of ensembles of semantic spaces is also introduced; these are shown to outperform the use of a single semantic space on the synonym extraction task. This idea allows different models of distributional semantics, with different parameter configurations and induced from different corpora, to be combined. This is not least important in the clinical domain, as it allows potentially limited amounts of clinical data to be supplemented with data from other, more readily available sources. The importance of configuring the dimensionality of semantic spaces, particularly when – as is typically the case in the clinical domain – the vocabulary grows large, is also demonstrated. / De stora mängder kliniska data som genereras i patientjournalsystem är en underutnyttjad resurs med en enorm potential att förbättra hälso- och sjukvården. Då merparten av kliniska data är i form av ostrukturerad text, vilken är utmanande för datorer att analysera, finns det ett behov av sofistikerade metoder som kan behandla kliniskt språk. Metoder som inte kräver märkta exempel utan istället utnyttjar statistiska egenskaper i datamängden är särskilt värdefulla, med tanke på den begränsade tillgången till annoterade korpusar i den kliniska domänen. System för informationsextraktion och språkbehandling behöver innehålla viss kunskap om semantik. En metod går ut på att utnyttja de distributionella egenskaperna hos språk – mer specifikt, statistisk över hur termer samförekommer – för att modellera den relativa betydelsen av termer i ett högdimensionellt vektorrum. Metoden har använts med framgång i en rad uppgifter för behandling av allmänna språk; dess tillämpning i den kliniska domänen har dock endast utforskats i mindre utsträckning. Genom att tillämpa modeller för distributionell semantik på klinisk text kan semantiska rum konstrueras utan någon tillgång till märkta exempel. Semantiska rum av klinisk text kan sedan användas i en rad medicinskt relevanta tillämpningar. Tillämpningen av distributionell semantik i den kliniska domänen illustreras här i tre användningsområden: (1) synonymextraktion av medicinska termer, (2) tilldelning av diagnoskoder och (3) identifiering av läkemedelsbiverkningar. Det krävs dock att vissa begränsningar eller utmaningar adresseras för att möjliggöra en effektiv tillämpning av distributionell semantik på ett brett spektrum av uppgifter som behandlar språk – både allmänt och, i synnerhet, kliniskt – såsom hur man kan modellera betydelsen av flerordstermer och redogöra för funktionen av negation: ett enkelt sätt att modellera parafrasering och negation i ett distributionellt semantiskt ramverk presenteras och utvärderas. Idén om ensembler av semantisk rum introduceras också; dessa överträffer användningen av ett enda semantiskt rum för synonymextraktion. Den här metoden möjliggör en kombination av olika modeller för distributionell semantik, med olika parameterkonfigurationer samt inducerade från olika korpusar. Detta är inte minst viktigt i den kliniska domänen, då det gör det möjligt att komplettera potentiellt begränsade mängder kliniska data med data från andra, mer lättillgängliga källor. Arbetet påvisar också vikten av att konfigurera dimensionaliteten av semantiska rum, i synnerhet när vokabulären är omfattande, vilket är vanligt i den kliniska domänen. / High-Performance Data Mining for Drug Effect Detection (DADEL)
|
27 |
Explaining complexity in human language processing : a distributional semantic model / . : .Chersoni, Emmanuele 04 July 2018 (has links)
Le présent travail aborde le thème de la complexité sémantique dans le langage naturel, et il propose une hypothèse basée sur certaines caractéristiques des phrases du langage naturel qui déterminent la difficulté pour l'interpretation humaine.Nous visons à introduire un cadre théorique général de la complexité sémantique de la phrase, dans lequel la difficulté d'élaboration est liée à l'interaction entre deux composants: la Mémoire, qui est responsable du rangement des représentations d'événements extraites par des corpus, et l'Unification, qui est responsable de la combinaison de ces unités dans des structures plus complexes. Nous proposons que la complexité sémantique depend de la difficulté de construire une représentation sémantique de l'événement ou de la situation exprimée par une phrase, qui peut être récupérée directement de la mémoire sémantique ou construit dynamiquement en satisfaisant les contraintes contenus dans les constructions.Pour tester nos intuitions, nous avons construit un Distributional Semantic Model pour calculer le coût de composition de l'unification des phrases. Les tests sur des bases de données psycholinguistiques ont révélé que le modèle est capable d'expliquer des phénomènes sémantiques comme la mise à jour context-sensitive des attentes sur les arguments et les métonymies logiques. / The present work deals with the problem of the semantic complexity in natural language, proposing an hypothesis based on some features of natural language sentences that determine their difficulty for human understanding. We aim at introducing a general framework for semantic complexity, in which the processing difficulty depends on the interaction between two components: a Memory component, which is responsible for the storage of corpus-extracted event representations, and a Unification component, which is responsible for combining the units stored in Memory into more complex structures. We propose that semantic complexity depends on the difficulty of building a semantic representation of the event or the situation conveyed by a sentence, that can be either retrieved directly from the semantic memory or built dynamically by solving the constraints included in the stored representations.In order to test our intuitions, we built a Distributional Semantic Model to compute a compositional cost for the sentence unification process. Our tests on several psycholinguistic datasets showed that our model is able to account for semantic phenomena such as the context-sensitive update of argument expectations and of logical metonymies.
|
Page generated in 0.0913 seconds