11 |
A Feature Structure Approach for Disambiguating Preposition SensesBaglodi, Venkatesh 01 January 2009 (has links)
Word Sense Disambiguation (WSD) continues to be an open research problem in spite of recent advances in the NLP field, especially in machine learning. WSD for open-class words is well understood. However, WSD for closed class structural words (such as prepositions) is not so well resolved, and their role in frame semantics seems to be a relatively unknown area. This research uses a new method to disambiguate preposition senses by using a combined lookup from FrameNet and TPP databases. Motivated by recent work by Popescu, Tonelli, & Pianta (2007), it extends the concept to provide a deterministic WSD of prepositions using the lexical information drawn from the sentences in a local context. While the primary goal of the research is to disambiguate preposition sense, the approach also assigns frames and roles to different sentence elements. The use of prepositions for frame and role assignment seems to be a largely unexplored area which could provide a new dimension to research in lexical semantics.
|
12 |
Desambiguação lexical de sentidos para o português por meio de uma abordagem multilíngue mono e multidocumento / Word Sense Disambiguation for portuguese through multilingual mono and multi-documentFernando Antônio Asevêdo Nóbrega 28 May 2013 (has links)
A ambiguidade lexical é considerada uma das principais barreiras para melhoria de aplicações do Processamento de Língua Natural (PLN). Neste contexto, tem-se a área de Desambiguação Lexical de Sentido (DLS), cujo objetivo é desenvolver e avaliar métodos que determinem o sentido correto de uma palavra em um determinado contexto por meio de um conjunto finito de possíveis significados. A DLS é empregada, principalmente, no intuito de prover recursos e ferramentas para diminuir problemas de ambiguidade e, consequentemente, contribuir para melhorias de resultados em outras áreas do PLN. Para o Português do Brasil, pouco se tem pesquisado nesta área, havendo alguns trabalhos bem específicos de domínio. Outro fator importante é que diversas áreas do PLN engajam-se no cenário multidocumento, onde a computação é efetuada sobre uma coleção de textos, todavia, não há relato de trabalhos de DLS direcionados a este cenário, tampouco experimentos de desambiguação neste domínio. Portanto, neste trabalho de mestrado, objetivou-se o desenvolvimento de métodos de DLS de domínio geral voltado à língua Portuguesa do Brasil e o desenvolvimento de algoritmos de desambiguação que façam uso de informações multidocumento, bem como a experimentação e avaliação destes no cenário multidocumento. Para tanto, a fim de subsidiar experimentos, desenvolvimento e avaliação deste projeto, anotou-se manualmente o córpus CSTNews, caracterizado como um córpus multidocumento, utilizando a WordNet de Princeton como repositório de sentidos, que organiza os significados por meio de conjuntos de sinônimos ( synsets) e relações linguísticas entre estes. Foram desenvolvidos quatro métodos de DLS e algumas variações, sendo: um método heurístico (para aferir valores de baseline); variações do algoritmo de Lesk (1986); adaptação do algoritmo de Mihalcea and Moldovan (1999); e uma variação do método de Lesk para o cenário multidocumento. Foram realizados três experimentos para avaliação dos métodos, cujos objetivos foram: determinar o desempenho geral dos algoritmos em todo o córpus; avaliar a qualidade de desambiguação de palavras mais ambíguas no córpus; e verificar o ganho de qualidade da desambiguação ao empregar informação multidocumento. Após estes experimentos, pôde-se observar que o método heurístico apresenta um melhor resultado geral. Contudo, é importante ressaltar que a maioria das palavras anotadas no córpus tiveram apenas um synset, que, normalmente, era o mais frequente, o que, consequentemente, apresenta um cenário mais propício ao método heurístico. Outro fato importante foi que, neste cenário, a diferença de desempenho entre o método de DLS multidocumento e o heurístico é estatisticamente irrelevante. Já para a desambiguação de palavras mais ambíguas, o método heurístico foi inferior, evidenciando que, para a desambiguação de palavras mais ambíguas, são necessários métodos mais sofisticados de DLS. Por fim, verificou-se que a utilização de informação multidocumento auxilia o processo de desambiguação. As contribuições deste trabalho podem ser agrupadas entre teóricas e técnicas. Nas teóricas, tem-se a investigação e análises da DLS no cenário multidocumento. Entre as contribuições técnicas, foram desenvolvidos métodos de DLS, um córpus anotado e uma ferramenta de anotação direcionados à língua Portuguesa do Brasil, que podem avançar as pesquisas em DLS para o idioma / The lexical ambiguity is considered one of the main barries to improving applications of Natural Language Processing (NLP). In this context, it has benn the area of Word Sense Disambiguation (WSD), whose goal is to develop and evaluate methods to determine the correct sense of a word in a give context by a nite set of possible meanings. The DLS is used mainly in order to provide resources and tools to reduce problems of ambiguity and thus contribute to improved results in other areas of NLP. In the Portuguese of Brazil, little has been researched in this area, with some work and specic domain. Another important factor is that many areas of NLP commit themselves in multidocument scenario, where the computation is performed on a collection of texts, however, there is no report of WSD work directed to this scenario, either disambiguation experiments in this eld. Therefore, this master thesis aimed to develop methods of WSD general domain facing the Portuguese language in Brazil and the development of algorithms that make use of disambiguation multidocument informations, as well as experimentation and evaluation of the multidocument scenario. Therefore, in order to support experiments, development and evaluation of this project, the corpus CSTNews with 50 document collections, was manually annotated by means of synsets of the WordNet Princeton. Four methods were developed: A heuristic method (to measure values fo baseline); variations of the Lesk (1986) algorithm; a adaptation of the Mihalcea and Moldovan (1999) algorithm; and a variation of the Lesk method for multidocument scenario. Three experiments were conducted to evaluate the methods, whose objectives were to determine the general performance algorithms across the corpus; evaluate the quality of disambiguation of most ambiguous words in the corpus, and check the gain quality of disambiguation by employing information multidocumento. After these experiments, it was observed that the heuristic method presents a better overall result. However, it is important to note that most of the words in the annotated corpus had only one synset, which usually was the most frequent, which, in turn, presents a scenario more conducive to the heuristic method. Another important fact was that in this scenario, the performance dierence between the heuristic method and multidocument algorithm was statistically irrelevant. As for the disambiguation of most ambiguous words, the heuristic method was lower, indicating that, for the disambiguation of ambiguous words, more sophisticated WSD methods are needed. Finally, it has been found that the use of multidocument information assists the disambiguation process. The contributions of this work can be divided between theoretical and technical. In theory, there is the research and analysis of WSD in multidocument scenario. Among the techniques contributions, WSD methods have been developed an annotated corpus and annotation tool targeted to the Portuguese language in Brazil that can advance research in WSD for the language
|
13 |
On repairing sentences : an experimental and computational analysis of recovery from unexpected syntactic disambiguation in sentence parsingGreen, Matthew James January 2013 (has links)
This thesis contends that the human parser has a repair mechanism. It is further contended that the human parser uses this mechanism to alter previously built structure in the case of unexpected disambiguation of temporary syntactic ambiguity. This position stands in opposition to the claim that unexpected disambiguation of temporary syntactic ambiguity is accomplished by the usual first pass parsing routines, a claim that arises from the relatively extraordinary capabilities of computational parsers, capabilities which have recently been extended by hypothesis to be available to the human sentence processing mechanism. The thesis argues that, while these capabilities have been demonstrated in computational parsers, the human parser is best explained in the terms of a repair based framework, and that this argument is demonstrated by examining eye movement behaviour in reading. In support of the thesis, evidence is provided from a set of eye tracking studies of reading. It is argued that these studies show that eye movement behaviours at disambiguation include purposeful visual search for linguistically relevant material, and that the form and structure of these searches vary reliably according to the nature of the repairs that the sentences necessitate.
|
14 |
Dynamic topic adaptation for improved contextual modelling in statistical machine translationHasler, Eva Cornelia January 2015 (has links)
In recent years there has been an increased interest in domain adaptation techniques for statistical machine translation (SMT) to deal with the growing amount of data from different sources. Topic modelling techniques applied to SMT are closely related to the field of domain adaptation but more flexible in dealing with unstructured text. Topic models can capture latent structure in texts and are therefore particularly suitable for modelling structure in between and beyond corpus boundaries, which are often arbitrary. In this thesis, the main focus is on dynamic translation model adaptation to texts of unknown origin, which is a typical scenario for an online MT engine translating web documents. We introduce a new bilingual topic model for SMT that takes the entire document context into account and for the first time directly estimates topic-dependent phrase translation probabilities in a Bayesian fashion. We demonstrate our model’s ability to improve over several domain adaptation baselines and further provide evidence for the advantages of bilingual topic modelling for SMT over the more common monolingual topic modelling. We also show improved performance when deriving further adapted translation features from the same model which measure different aspects of topical relatedness. We introduce another new topic model for SMT which exploits the distributional nature of phrase pair meaning by modelling topic distributions over phrase pairs using their distributional profiles. Using this model, we explore combinations of local and global contextual information and demonstrate the usefulness of different levels of contextual information, which had not been previously examined for SMT. We also show that combining this model with a topic model trained at the document-level further improves performance. Our dynamic topic adaptation approach performs competitively in comparison with two supervised domain-adapted systems. Finally, we shed light on the relationship between domain adaptation and topic adaptation and propose to combine multi-domain adaptation and topic adaptation in a framework that entails automatic prediction of domain labels at the document level. We show that while each technique provides complementary benefits to the overall performance, there is an amount of overlap between domain and topic adaptation. This can be exploited to build systems that require less adaptation effort at runtime.
|
15 |
Semantic disambiguation using Distributional Semantics / Semantic disambiguation using Distributional SemanticsProdanovic, Srdjan January 2012 (has links)
Ve statistických modelů sémantiky jsou významy slov pouze na základě jejich distribuční vlastnosti.Základní zdroj je zde jeden slovník, který lze použít pro různé úkoly, kde se význam slov reprezentovány jako vektory v vektorového prostoru, a slovní podoby jako vzdálenosti mezi jejich vektorových osobnosti. Pomocí silných podobnosti, může vhodnost podmínek uvedených zejména v souvislosti se vypočítá a používá pro celou řadu úkolů, jeden z nich je slovo smysl Disambiguation. V této práci bylo vyšetřeno několik různých přístupů k modelům z vektorového prostoru a prováděny tak, aby k překročení vyhodnocení vlastního výkonu na Word Sense disambiguation úkolem Prague Dependency Treebank.
|
16 |
Improving Intent Classication By Automatic Data Augmentation Using Word Sense DisambiguationJanuary 2018 (has links)
abstract: Virtual digital assistants are automated software systems which assist humans by understanding natural languages such as English, either in voice or textual form. In recent times, a lot of digital applications have shifted towards providing a user experience using natural language interface. The change is brought up by the degree of ease with which the virtual digital assistants such as Google Assistant and Amazon Alexa can be integrated into your application. These assistants make use of a Natural Language Understanding (NLU) system which acts as an interface to translate unstructured natural language data into a structured form. Such an NLU system uses an intent finding algorithm which gives a high-level idea or meaning of a user query, termed as intent classification. The intent classification step identifies the action(s) that a user wants the assistant to perform. The intent classification step is followed by an entity recognition step in which the entities in the utterance are identified on which the intended action is performed. This step can be viewed as a sequence labeling task which maps an input word sequence into a corresponding sequence of slot labels. This step is also termed as slot filling.
In this thesis, we improve the intent classification and slot filling in the virtual voice agents by automatic data augmentation. Spoken Language Understanding systems face the issue of data sparsity. The reason behind this is that it is hard for a human-created training sample to represent all the patterns in the language. Due to the lack of relevant data, deep learning methods are unable to generalize the Spoken Language Understanding model. This thesis expounds a way to overcome the issue of data sparsity in deep learning approaches on Spoken Language Understanding tasks. Here we have described the limitations in the current intent classifiers and how the proposed algorithm uses existing knowledge bases to overcome those limitations. The method helps in creating a more robust intent classifier and slot filling system. / Dissertation/Thesis / Masters Thesis Computer Science 2018
|
17 |
Improved Cross-language Information Retrieval via Disambiguation and Vocabulary DiscoveryZhang, Ying, ying.yzhang@gmail.com January 2007 (has links)
Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR.
|
18 |
Corpus-Based Techniques for Word Sense DisambiguationLevow, Gina-Anne 27 May 1998 (has links)
The need for robust and easily extensible systems for word sense disambiguation coupled with successes in training systems for a variety of tasks using large on-line corpora has led to extensive research into corpus-based statistical approaches to this problem. Promising results have been achieved by vector space representations of context, clustering combined with a semantic knowledge base, and decision lists based on collocational relations. We evaluate these techniques with respect to three important criteria: how their definition of context affects their ability to incorporate different types of disambiguating information, how they define similarity among senses, and how easily they can generalize to new senses. The strengths and weaknesses of these systems provide guidance for future systems which must capture and model a variety of disambiguating information, both syntactic and semantic.
|
19 |
Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State TransducerNojoumian, Peyman 12 August 2011 (has links)
Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics.
This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work.
|
20 |
Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State TransducerNojoumian, Peyman 12 August 2011 (has links)
Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics.
This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work.
|
Page generated in 0.1131 seconds