1 |
A Graph Approach to Measuring Text DistanceTsang, Vivian 26 February 2009 (has links)
Text comparison is a key step in many natural language processing (NLP)
applications in which texts can be classified on the basis of their semantic
distance (how similar or different the texts are). For example, comparing the
local context of an ambiguous word with that of a known word can help identify
the sense of the ambiguous word. Typically, a distributional measure is used
to capture the implicit semantic distance between two pieces of text. In this
thesis, we introduce an alternative method of measuring the semantic distance
between texts as a combination of distributional information and
relational/ontological knowledge. In this work, we propose a novel distance
measure within a network-flow formalism that combines these two distinct
components in a way that they are not treated as separate and orthogonal
pieces of information. First, we represent each text as a collection of
frequency-weighted concepts within a relational thesaurus. Then, we make use
of a network-flow method which provides an efficient way of measuring the
semantic distance between two texts by taking advantage of the inherently
graphical structure in an ontology. We evaluate our method in a variety of
NLP tasks.
In our task-based evaluation, we find that our method performs well on two of
three tasks. We introduce a novel measure which is intended to capture how
well our network-flow method perform on a dataset (represented as a collection
of frequency-weighted concepts). In our analysis, we find that an integrated
approach, rather than a purely distributional or graphical analysis, is more
effective in explaining the performance inconsistency.
Finally, we address a complexity issue that arises from the overhead
required to incorporate more sophisticated concept-to-concept distances
into the network-flow framework. We propose a graph transformation
method which generates a pared-down network that requires less time to
process. The new method achieves a significant speed improvement, and
does not seriously hamper performance as a result of the transformation,
as indicated in our analysis.
|
2 |
A Graph Approach to Measuring Text DistanceTsang, Vivian 26 February 2009 (has links)
Text comparison is a key step in many natural language processing (NLP)
applications in which texts can be classified on the basis of their semantic
distance (how similar or different the texts are). For example, comparing the
local context of an ambiguous word with that of a known word can help identify
the sense of the ambiguous word. Typically, a distributional measure is used
to capture the implicit semantic distance between two pieces of text. In this
thesis, we introduce an alternative method of measuring the semantic distance
between texts as a combination of distributional information and
relational/ontological knowledge. In this work, we propose a novel distance
measure within a network-flow formalism that combines these two distinct
components in a way that they are not treated as separate and orthogonal
pieces of information. First, we represent each text as a collection of
frequency-weighted concepts within a relational thesaurus. Then, we make use
of a network-flow method which provides an efficient way of measuring the
semantic distance between two texts by taking advantage of the inherently
graphical structure in an ontology. We evaluate our method in a variety of
NLP tasks.
In our task-based evaluation, we find that our method performs well on two of
three tasks. We introduce a novel measure which is intended to capture how
well our network-flow method perform on a dataset (represented as a collection
of frequency-weighted concepts). In our analysis, we find that an integrated
approach, rather than a purely distributional or graphical analysis, is more
effective in explaining the performance inconsistency.
Finally, we address a complexity issue that arises from the overhead
required to incorporate more sophisticated concept-to-concept distances
into the network-flow framework. We propose a graph transformation
method which generates a pared-down network that requires less time to
process. The new method achieves a significant speed improvement, and
does not seriously hamper performance as a result of the transformation,
as indicated in our analysis.
|
3 |
利用WordNet建立證券領域的語意結構游舒帆, Yu,Shu Fan Unknown Date (has links)
本研究主要在探討普林斯頓大學所開發出來的WordNet線上辭典是否適合用在語意結構(Semantic Structure)的表達上,在整個研究中,我們會先將重點放在WordNet架構的討論,接著研究關於WordNet在建立語意結構上的文獻,以在研究前先取得過去研究的狀況,並針對缺點提出改進方案,最後則進行模式的驗證與修改,期望能得出一個較具代表性且完整的WordNet語意結構。
本研究採用Jarmasz, Szpakowicz(2001)的語意距離計算模式併Resnik(1995)的相似度(similarity)計算模式,透過這兩個模式來計算出詞彙的距離,並以此距離來辨別語意的關係,最後透過117道證券考題來實證這個架構的正確性與完整性,並針對不足之處作補強修改,以達到較佳的結果。
本研究的主要限制為下列幾項:
一、無法全盤的將證券業的所有的詞彙及其關係一次含括進來
二、測試的題目無法完整代表所有的問題可能性
三、由於最後結果並非實際架構與修改WordNet系統,僅僅是採用相似度
計算演算法算出結果,因此與實際機上測試難免會有所差距。
四、並沒有針對WordNet中所有的關係都做定義,僅只挑選較具代表性的
幾個詞彙關係做定義,在細部上可能會有所影響。 / This paper is mainly focusing on does the Princeton WordNet fit the Semantic Structure. In this research, we’ll discuss the structure of WordNet, then the reference of WordNet in Semantic Structure. Before we get start, we may collect all the passed data, and study the data more detail. Then we can know the situation and result of passed reseach, so we can modify the model of pass. Finally, we hope we can get a more completed WordNet semantic structure.
This paper uses the Jarmasz, Szpakowicz’s (2001) semantic distance and Resnik’s Similarity calculative model. Through
this two models to calculating the distance between two words, and calculating the similarity.
We collect 117 stock exam questions to verify the correctiveness and the completeness of this structure. And to complement the weakness, so we can have a more strong result.
This research has three constraints:
1.We can’t collect all words of stock domain
2.The 117 questions can’t explain all probability of query
3.We just run an algorithm to calculate the similarity, not
real testing on WordNet system, so it may be some bias.
4.Only identifying some chief words relationship, so it can not cover whole relations.
|
4 |
Détection de termes sémantiquement proches : clustering non supervisé basé sur les relations sémantiques et le degré d'apparenté sémantique / Detection of terms semantically close : unsupervised clustering based on semantic relations and the degree of related semanticDupuch, Marie 19 September 2014 (has links)
L'utilisation de termes équivalents ou sémantiquement proches est nécessaire pour augmenter la couverture et la sensibilité d'une application comme la recherche et l'extraction d'information ou l'annotation sémantique de documents. Dans le contexte de l'identification d'effets indésirables susceptibles d'être dûs à un médicament, la sensibilité est aussi recherchée afin de détecter plus exhaustivement les déclarations spontanées et de mieux surveiller le risque médicamenteux. C'est la raison qui motive notre travail. Dans notre travail de thèse, nous cherchons ainsi à détecter des termes sémantiquement proches et à les regrouper en utilisant plusieurs méthodes : des algorithmes de clustering non supervisés, des ressources terminologiques exploitées avec le raisonnement terminologique et des méthodes de Traitement Automatique de la Langue, comme la structuration de terminologies, où nous visons la détection de relations hiérarchiques et synonymiques. Nous avons réalisé de nombreuses expériences et évaluations des clusters générés, qui montrent que les méthodes proposées peuvent contribuer efficacement à la tâche visée. / The use of equivalent terms or semantically close is necessary to increase the coverageand sensitivity of applications such as information retrieval and extraction or semanticannotation of documents. In the context of the adverse drug reactions identification, sensitivityis also sought to detect more exhaustively spontaneous reports and better monitordrug risk. This is the reason that motivates our work. In our work, we thus seek to detectsemantically close terms and the together using several methods : unsupervised algorithms, terminological resources exploited with terminological reasoning and methodsof Natural Language Processing, such as terminology structuring, where we aim to detecthierarchical and synonymous relations. We conducted many experiments and evaluations of generated, which show that the proposed methods can efficiently contribute tothe task in question.
|
5 |
ViewpointS : vers une émergence de connaissances collectives par élicitation de point de vue / ViewpointS : collective knowledge emerging from viewpoints elicitationSurroca, Guillaume 30 June 2017 (has links)
Le Web d’aujourd’hui est formé, entre autres, de deux types de contenus que sont les données structurées et liées du Web sémantique et les contributions subjectives des utilisateurs du Web social. L’approche ViewpointS a été conçue comme un formalisme creuset apte à intégrer ces deux types de contenus, en préservant la subjectivité des interactions du Web Social. ViewpointS est une approche de représentation subjective des connaissances. Les connaissances sont représentées sous forme de points de vue – des viewpoints – qui sont des éléments de base d’une sémantique individuelle déclarant la proximité de deux ressources. L’approche propose aussi un second degré de subjectivité. En effet, viewpoints peuvent être interprétés différemment selon l’utilisateur grâce au mécanisme de perspective. Il y a une subjectivité dans la connaissance capturée ainsi que dans la manière de l’exploiter. En complément aux approches top-down où la sémantique collective d’un groupe est établie par consensus, la sémantique collective d’une communauté ViewpointS émerge de façon « bottom-up » de l’échange et la confrontation des viewpoints et évolue de manière fluide au fur et à mesure de leur émission. Les ressources du Web sont représentées et liées par les viewpoints dans le Graphe de Connaissances. A l’utilisation, les viewpoints entre deux ressources sont agrégés pour créer une « synapse ». A partir du Graphe de Connaissances contenant les viewpoints et les ressources du Web une Carte de Connaissances composée de synapses et de ressources est créée qui est le fruit de l’interprétation et de l’agrégation des viewpoints. Chaque viewpoint contribue à la création, au renforcement ou à l’affaiblissement d’une synapse qui relie deux ressources. L’échange de viewpoints est le processus de sélection qui permet l’évolution des synapses d’une manière analogue à celles qui évoluent dans le cerveau au fil d’un sélectionnisme neuronal. Nous investiguons dans cette étude l’impact que peut avoir la représentation subjective des connaissances dans divers scénarii de construction collective des connaissances. Les domaines traités sur les bénéfices de la subjectivité des connaissances représentées sont la recherche d’information, la recommandation, l’alignement multilingue d’ontologies et les méthodes de calcul de distance sémantique. / Nowadays, the Web is formed by two types of content which are linked: structured data of the so-called Semantic Web and users’ contributions of the Social Web. The ViewpointS approach was de-signed as an integrative formalism capable of mixing these two types of content while preserving the subjectivity of the interactions of the Social Web. ViewpointS is a subjective knowledge repre-sention approach. Knowledge is represented by means of viewpoints which are micro-expressions of individual semantics tying the relation between two Web resources. The approach also provides a second level of subjectivity. Indeed, the viewpoints can be interpreted differently according to the user through the perspective mechanism. In addition to a top-down approach where collective semantics of a group is established by consensus, collective semantics of a ViewpointS community is emerging from the exchange and confrontation of viewpoints and evolve fluidly. In our frame-work, resources from the Web are tied by viewpoints in a Knowledge Graph. From the Knowledge Graph containing viewpoints and Web resources a Knowledge Map consisting of “synapses” and re-sources is created as a result of the interpretation and aggregation of viewpoints. The evolution of the ViewpointS synapses may be considered analog to the ones in the brain in the very simple sense that each viewpoint contributes to the establishment, strengthening or weakening of a syn-apse that connects two resources. The exchange of viewpoints is the selection process ruling the synapses evolution like the selectionist process within the brain.We investigate in this study the potential impact of our subjective representation of knowledge in various fields: information search, recommendation, multilingual ontology alignment and methods for calculating semantic distances.
|
6 |
Alternative Approaches to Correction of Malapropisms in AIML Based Conversational AgentsBrock, Walter A. 26 November 2014 (has links)
The use of Conversational Agents (CAs) utilizing Artificial Intelligence Markup Language (AIML) has been studied in a number of disciplines. Previous research has shown a great deal of promise. It has also documented significant limitations in the abilities of these CAs. Many of these limitations are related specifically to the method employed by AIML to resolve ambiguities in the meaning and context of words. While methods exist to detect and correct common errors in spelling and grammar of sentences and queries submitted by a user, one class of input error that is particularly difficult to detect and correct is the malapropism. In this research a malapropism is defined a "verbal blunder in which one word is replaced by another similar in sound but different in meaning" ("malapropism," 2013).
This research explored the use of alternative methods of correcting malapropisms in sentences input to AIML CAs using measures of Semantic Distance and tri-gram probabilities. Results of these alternate methods were compared against AIML CAs using only the Symbolic Reductions built into AIML.
This research found that the use of the two methodologies studied here did indeed lead to a small, but measurable improvement in the performance of the CA in terms of the appropriateness of its responses as classified by human judges. However, it was also noted that in a large number of cases, the CA simply ignored the existence of a malapropism altogether in formulating its responses. In most of these cases, the interpretation and response to the user's input was of such a general nature that one might question the overall efficacy of the AIML engine. The answer to this question is a matter for further study.
|
7 |
Recomendação de conteúdo baseada em informações semânticas extraídas de bases de conhecimento / Content recommendation based on semantic information extracted from knowledge basesSilva Junior, Salmo Marques da 10 May 2017 (has links)
A fim de auxiliar usuários durante o consumo de produtos, sistemas Web passaram a incorporar módulos de recomendação de itens. As abordagens mais populares são a baseada em conteúdo, que recomenda itens a partir de características que são do seu interesse, e a filtragem colaborativa, que recomenda itens bem avaliados por usuários com perfis semelhantes ao do usuário alvo, ou que são semelhantes aos que foram bem avaliados pelo usuário alvo. Enquanto que a primeira abordagem apresenta limitações como a sobre-especialização e a análise limitada de conteúdo, a segunda enfrenta problemas como o novo usuário e/ou novo item, também conhecido como partida fria. Apesar da variedade de técnicas disponíveis, um problema comum existente na maioria das abordagens é a falta de informações semânticas para representar os itens do acervo. Trabalhos recentes na área de Sistemas de Recomendação têm estudado a possibilidade de usar bases de conhecimento da Web como fonte de informações semânticas. Contudo, ainda é necessário investigar como usufruir de tais informações e integrá-las de modo eficiente em sistemas de recomendação. Dessa maneira, este trabalho tem o objetivo de investigar como informações semânticas provenientes de bases de conhecimento podem beneficiar sistemas de recomendação por meio da descrição semântica de itens, e como o cálculo da similaridade semântica pode amenizar o desafio enfrentado no cenário de partida fria. Como resultado, obtém-se uma técnica que pode gerar recomendações adequadas ao perfil dos usuários, incluindo itens novos do acervo que sejam relevantes. Pode-se observar uma melhora de até 10% no RMSE, no cenário de partida fria, quando se compara o sistema proposto com o sistema cuja predição de notas é baseada na correlação de notas. / In order to support users during the consumption of products,Web systems have incorporated recommendation techniques. The most popular approaches are content-based, which recommends items based on interesting features to the user, and collaborative filtering, which recommends items that were well evaluated by users with similar preferences to the target user, or that have similar features to items which were positively evaluated. While the first approach has limitations such as overspecialization and limited content analysis, the second technique has problems such as the new user and the new item, limitation also known as cold start. In spite of the variety of techniques available, a common problem is the lack of semantic information to represent items features. Recent works in the field of recommender systems have been studying the possibility to use knowledge databases from the Web as a source of semantic information. However, it is still necessary to investigate how to use and integrate such semantic information in recommender systems. In this way, this work has the proposal to investigate how semantic information gathered from knowledge databases can help recommender systems by semantically describing items, and how semantic similarity can overcome the challenge confronted in the cold-start scenario. As a result, we obtained a technique that can produce recommendations suited to users profiles, including relevant new items available in the database. It can be observed an improvement of up to 10% in the RMSE in the cold start scenario when comparing the proposed system with the system whose rating prediction is based on the correlation of rates.
|
8 |
Recomendação de conteúdo baseada em informações semânticas extraídas de bases de conhecimento / Content recommendation based on semantic information extracted from knowledge basesSalmo Marques da Silva Junior 10 May 2017 (has links)
A fim de auxiliar usuários durante o consumo de produtos, sistemas Web passaram a incorporar módulos de recomendação de itens. As abordagens mais populares são a baseada em conteúdo, que recomenda itens a partir de características que são do seu interesse, e a filtragem colaborativa, que recomenda itens bem avaliados por usuários com perfis semelhantes ao do usuário alvo, ou que são semelhantes aos que foram bem avaliados pelo usuário alvo. Enquanto que a primeira abordagem apresenta limitações como a sobre-especialização e a análise limitada de conteúdo, a segunda enfrenta problemas como o novo usuário e/ou novo item, também conhecido como partida fria. Apesar da variedade de técnicas disponíveis, um problema comum existente na maioria das abordagens é a falta de informações semânticas para representar os itens do acervo. Trabalhos recentes na área de Sistemas de Recomendação têm estudado a possibilidade de usar bases de conhecimento da Web como fonte de informações semânticas. Contudo, ainda é necessário investigar como usufruir de tais informações e integrá-las de modo eficiente em sistemas de recomendação. Dessa maneira, este trabalho tem o objetivo de investigar como informações semânticas provenientes de bases de conhecimento podem beneficiar sistemas de recomendação por meio da descrição semântica de itens, e como o cálculo da similaridade semântica pode amenizar o desafio enfrentado no cenário de partida fria. Como resultado, obtém-se uma técnica que pode gerar recomendações adequadas ao perfil dos usuários, incluindo itens novos do acervo que sejam relevantes. Pode-se observar uma melhora de até 10% no RMSE, no cenário de partida fria, quando se compara o sistema proposto com o sistema cuja predição de notas é baseada na correlação de notas. / In order to support users during the consumption of products,Web systems have incorporated recommendation techniques. The most popular approaches are content-based, which recommends items based on interesting features to the user, and collaborative filtering, which recommends items that were well evaluated by users with similar preferences to the target user, or that have similar features to items which were positively evaluated. While the first approach has limitations such as overspecialization and limited content analysis, the second technique has problems such as the new user and the new item, limitation also known as cold start. In spite of the variety of techniques available, a common problem is the lack of semantic information to represent items features. Recent works in the field of recommender systems have been studying the possibility to use knowledge databases from the Web as a source of semantic information. However, it is still necessary to investigate how to use and integrate such semantic information in recommender systems. In this way, this work has the proposal to investigate how semantic information gathered from knowledge databases can help recommender systems by semantically describing items, and how semantic similarity can overcome the challenge confronted in the cold-start scenario. As a result, we obtained a technique that can produce recommendations suited to users profiles, including relevant new items available in the database. It can be observed an improvement of up to 10% in the RMSE in the cold start scenario when comparing the proposed system with the system whose rating prediction is based on the correlation of rates.
|
9 |
The flexibility of the language production systemRose, Sebastian 17 November 2016 (has links)
Die Auswahl eines passenden Wortes aus semantisch verbundenen Wettbewerbern ist eine wesentliche Funktion der Sprachproduktion. Neuere strittige Befunde scheinen traditionellen lexikalischen Selektionsmodellen zu widersprechen. Der swinging lexical network (SLN) Ansatz offeriert eine kompetitiven Bezugsrahmen, der spezifische Voraussetzungen formuliert, unter denen semantische Erleichterungs- als auch Interferenzeffekte in Bildbenennungsparadigmen beobachtet werden können. Diese spezifischen Voraussetzungen betreffen a) die Manipulation eines Trade-offs zwischen konzeptueller Erleichterung und lexikalischer Interferenz, b) das Ausmaß an lexikalischer Kohortenaktivierung und c) die flexible Anpassungsfähigkeit des Sprachproduktionssystems. Die Trade-off-Annahme wurde durch Einflüsse von Assoziationen auf die Benennungslatenz untersucht (Studie 1), wenn Stimuli im kontinuierlichen Benennungsparadigma in einer scheinbar zufälligen Reihenfolge benannt werden. Information über den Einfluss lexikalischer Kohortenaktivierung auf die Wortproduktion wurde durch Manipulation semantischer Distanz und durch Kombination des kontinuierlichen Benennungsparadigmas mit ereignis-korrelierten Potentialen (EKPs) gewonnen (Studie 2). Zur Überprüfung der Flexibilitätsannahme werden Benennungslatenzen von Homophonen mittels Bild-Wort-Interferenzparadigma untersucht, nachdem Versuchspersonen wiederholt linguistische Mehrdeutigkeit verarbeiten haben (Studie 3). Die Ergebnisse zeigen semantische Interferenzeffekte für assoziativ und für eng kategorial verbundene Stimuli im kontinuierlichen Benennungsparadigma (Studie 1 & 2) und Erleichterungseffekte für Homophone im PWI, nachdem das kognitive System sich auf Mehrdeutigkeit adaptiert hatte (Studie 3). Eng kategorial verbundene Stimuli modulierten EKP-Komponenten in der P1, zwischen 250 und 400 ms und im N400-Zeitfenster, welche mit Wortproduktions-prozessen in Verbindung gebracht werden. / The selection of an appropriate word from other meaning-related competitors is a main function of language production. Recent inconclusive findings have casted doubt about traditional lexical selection accounts. The swinging lexical network (SLN) account presents a competitive framework that formulates specific conditions under which semantic facilitation or interference effects can be observed in picture naming paradigms. These specific conditions concern a) the manipulation of the trade-off between conceptual facilitation and lexical interference, b) the extent of lexical cohort activation and c) the flexible nature of the language production system. The trade-off assumption was assessed by investigating the impact of associations on naming latencies in the continuous naming paradigm in which semantically related items are named within a seemingly random sequence (Study 1). Information for the understanding of lexical cohort activation on word production was obtained by manipulating semantic distance in the continuous naming paradigm combined with event-related potentials (ERP; Study 2). Aiming at testing the flexibility assumption, effects of unrelated meaning alternatives of homophones in a picture-word interference (PWI) paradigm were investigated, after participants repeatedly processed linguistic ambiguities (Study 3). Results show semantic interference for associates and for closely related category co-ordinates in the continuous naming paradigm (Study 1 & 2), and facilitation effects for homophone names in the PWI after the cognitive system adapted to the processing of linguistic ambiguities (Study 3). Closely related stimuli modulated ERPs in the P1, between 250 and 400 ms, and in the N400 time window, which are known to be associated with single word naming processes. These results support the SLN model and enhance the understanding of semantic and cognitive factors that shape the microstructure of language production.
|
10 |
Measuring Semantic Distance using Distributional Profiles of ConceptsMohammad, Saif 01 August 2008 (has links)
Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine
translation and word sense disambiguation,
can be viewed as semantic distance problems.
The two dominant approaches to estimating semantic distance are the WordNet-based semantic measures and the corpus-based distributional measures. In this thesis, I compare them, both qualitatively and quantitatively, and identify the limitations of each.
This thesis argues that estimating semantic distance is essentially a property of
concepts (rather than words) and that
two concepts are semantically close if they occur in similar contexts.
Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), I argue that distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately. I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus).
The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks.
I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems
in one, possibly resource-poor, language using a knowledge source from another,
possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization.
The proposed approach is computationally inexpensive, it can estimate both semantic
relatedness and semantic similarity, and it can be applied to all parts of speech.
Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and
word translation show that the new approach is markedly superior to previous ones.
|
Page generated in 0.07 seconds