Global ETD Search

21	Cross-lingual Information Retrieval On Turkish And English Texts Boynuegri, Akif 01 April 2010 (has links) (PDF) In this thesis, cross-lingual information retrieval (CLIR) approaches are comparatively evaluated for Turkish and English texts. As a complementary study, knowledge-based methods for word sense disambiguation (WSD), which is one of the most important parts of the CLIR studies, are compared for Turkish words. Query translation and sense indexing based CLIR approaches are used in this study. In query translation approach, we use automatic and manual word sense disambiguation methods and Google translation service during translation of queries. In sense indexing based approach, documents are indexed according to meanings of words instead of words themselves. Retrieval of documents is performed according to meanings of the query words as well. During the identification of intended meaning of query terms, manual and automatic word sense disambiguation methods are used and compared to each other. Knowledge based WSD methods that use different gloss enrichment techniques are compared for Turkish words. Turkish WordNet is used as a primary knowledge base and English WordNet and Turkish Wikipedia are employed as enrichment resources. Meanings of words are more clearly identified by using semantic relations defined in WordNets and Turkish Wikipedia. Also, during calculation of semantic relatedness of senses, cosine similarity metric is used as an alternative metric to word overlap count. Effects of using cosine similarity metric are observed for each WSD methods that use different knowledge bases.
22	Word meaning in context as a paraphrase distribution : evidence, learning, and inference Moon, Taesun, Ph. D. 25 October 2011 (has links) In this dissertation, we introduce a graph-based model of instance-based, usage meaning that is cast as a problem of probabilistic inference. The main aim of this model is to provide a flexible platform that can be used to explore multiple hypotheses about usage meaning computation. Our model takes up and extends the proposals of Erk and Pado [2007] and McCarthy and Navigli [2009] by representing usage meaning as a probability distribution over potential paraphrases. We use undirected graphical models to infer this probability distribution for every content word in a given sentence. Graphical models represent complex probability distributions through a graph. In the graph, nodes stand for random variables, and edges stand for direct probabilistic interactions between them. The lack of edges between any two variables reflect independence assumptions. In our model, we represent each content word of the sentence through two adjacent nodes: the observed node represents the surface form of the word itself, and the hidden node represents its usage meaning. The distribution over values that we infer for the hidden node is a paraphrase distribution for the observed word. To encode the fact that lexical semantic information is exchanged between syntactic neighbors, the graph contains edges that mirror the dependency graph for the sentence. Further knowledge sources that influence the hidden nodes are represented through additional edges that, for example, connect to document topic. The integration of adjacent knowledge sources is accomplished in a standard way by multiplying factors and marginalizing over variables. Evaluating on a paraphrasing task, we find that our model outperforms the current state-of-the-art usage vector model [Thater et al., 2010] on all parts of speech except verbs, where the previous model wins by a small margin. But our main focus is not on the numbers but on the fact that our model is flexible enough to encode different hypotheses about usage meaning computation. In particular, we concentrate on five questions (with minor variants): - Nonlocal syntactic context: Existing usage vector models only use a word's direct syntactic neighbors for disambiguation or inferring some other meaning representation. Would it help to have contextual information instead "flow" along the entire dependency graph, each word's inferred meaning relying on the paraphrase distribution of its neighbors? - Influence of collocational information: In some cases, it is intuitively plausible to use the selectional preference of a neighboring word towards the target to determine its meaning in context. How does modeling selectional preferences into the model affect performance? - Non-syntactic bag-of-words context: To what extent can non-syntactic information in the form of bag-of-words context help in inferring meaning? - Effects of parametrization: We experiment with two transformations of MLE. One interpolates various MLEs and another transforms it by exponentiating pointwise mutual information. Which performs better? - Type of hidden nodes: Our model posits a tier of hidden nodes immediately adjacent the surface tier of observed words to capture dynamic usage meaning. We examine the model based on by varying the hidden nodes such that in one the nodes have actual words as values and in the other the nodes have nameless indexes as values. The former has the benefit of interpretability while the latter allows more standard parameter estimation. Portions of this dissertation are derived from joint work between the author and Katrin Erk [submitted]. / text Computational linguistics Lexical semantics Probabilistic graphical models Natural language processing Word sense disambiguation Paraphrasing
23	Using web texts for word sense disambiguation Wang, Yuanyong, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links) In all natural languages, ambiguity is a universal phenomenon. When a word has multiple meaning depending on its contexts it is called an ambiguous word. The process of determining the correct meaning of a word (formally named word sense) in a given context is word sense disambiguation(WSD). WSD is one of the most fundamental problems in natural language processing. If properly addressed, it could lead to revolutionary advancement in many other technologies such as text search engine technology, automatic text summarization and classification, automatic lexicon construction, machine translation and automatic learning agent technology. One difficulty that has always confronted WSD researchers is the lack of high quality sense specific information. For example, if the word "power" Immediately preceds the word "plant", it would strongly constrain the meaning of "plant" to be "an industrial facility". If "power" is replaced by the phrase "root of a", then the sense of "plant" is dictated to be "an organism" of the kingdom Planate. It is obvious that manually building a comprehensive sense specific information base for each sense of each word is impractical. Researchers also tried to extract such information from large dictionaries as well as manually sense tagged corpora. Most of the dictionaries used for WSD are not built for this purpose and have a lot of inherited peculiarities. While manual tagging is slow and costly, automatic tagging is not successful in providing a reliable performance. Furthermore, it is often the case that for a randomly chosen word (to be disambiguated), the sense specific context corpora that can be collected from dictionaries are not large enough. Therefore, manually building sense specific information bases or extraction of such information from dictionaries are not effective approaches to obtain sense specific information. A web text, due to its vast quantity and wide diversity, becomes an ideal source for extraction of large quantity of sense specific information. In this thesis, the impacts of Web texts on various aspects of WSD has been investigated. New measures and models are proposed to tame enormous amount of Web texts for the purpose of WSD. They are formally evaluated by experimenting their disambiguation performance on about 70 ambiguous nouns. The results are very encouraging and have helped revealing the great potential of using Web texts for WSD. The results are published in three papers at Australia national and international level (Wang&Hoffmann,2004,2005,2006)[42][43][44]. Search engine. Word sense disambiguation. Ambiguity. Semantics -- Data processing. Computational linguistics.
24	Interopérabilité Sémantique Multi-lingue des Ressources Lexicales en Données Liées Ouvertes / Semantic Interoperability of Multilingual Lexical Resources in Lexical Linked Data Tchechmedjiev, Andon 14 October 2016 (has links) Lorsqu’il s’agit la construction de ressources lexico-sémantiques multilingues, la première chose qui vient à l’esprit, et la nécessité que les ressources à alignées partagent le même format de données et la même représentations (interopérabilité représentationnelle). Avec l’apparition de standard tels que LMF et leur adaptation au web sémantique pour la production de ressources lexico- sémantiques multilingues en tant que données lexicales liées ouvertes (Ontolex), l’interopérabilité représentationnelle n’est plus un verrou majeur. Cependant, en ce qui concerne l’interopérabilité des alignements multilingues, le choix et la construction du pivot interlingue est l’un des obstacles principaux. Pour nombre de ressources (par ex. BabelNet, EuroWordNet), le choix est fait d’utiliser l’Anglais, ou une autre langue comme pivot interlingue. Ce choix mène à une perte de contraste dans les cas où des sens du Pivot ont des lexicalisations différentes dans la même acception dans plusieurs autres langues. L’utilisation d’une pivot à acceptions interlingues, solution proposée il y a déjà plus de 20 ans, pourrait être viable. Néanmoins, leur construction manuelle est trop ardue du fait du manque d’experts parlant assez de langues et leur construction automatique pose problème du fait de l’absence d’une formalisation et d’une caractérisation axiomatique permettant de garantir leur propriétés. Nous proposons dans cette thèse de d’abord formaliser l’architecture à pivot interlingue par acceptions, en développant une axiomatisation garantissant leurs propriétés. Nous proposons ensuite des algorithmes de construction initiale automatique en utilisant les propriétés combinatoires du graphe des alignements bilingues, mais aussi des algorithmes de mise à jour garantissant l’interopérabilité dynamique. Dans un deuxième temps, nous étudions de manière plus pratique sur DBNary, un extraction périodique de Wiktionary dans de nombreuses éditions de langues, afin de cerner les contraintes pratiques à l’application des algorithmes proposés. / When it comes to the construction of multilingual lexico-semantic resources, the first thing that comes to mind is that the resources we want to align, should share the same data model and format (representational interoperability). However, with the emergence of standards such as LMF and their implementation and widespread use for the production of resources as lexical linked data (Ontolex), representational interoperability has ceased to be a major challenge for the production of large-scale multilingual resources. However, as far as the interoperability of sense-level multi-lingual alignments is concerned, a major challenge is the choice of a suitable interlingual pivot. Many resources make the choice of using English senses as the pivot (e.g. BabelNet, EuroWordNet), although this choice leads to a loss of contrast between English senses that are lexicalized with a different words in other languages. The use of acception-based interlingual representations, a solution proposed over 20 years ago, could be viable. However, the manual construction of such language-independent pivot representations is very difficult due to the lack of expert speaking enough languages fluently and algorithms for their automatic constructions have never since materialized, mainly because of the lack of a formal axiomatic characterization that ensures the pre- servation of their correctness properties. In this thesis, we address this issue by first formalizing acception-based interlingual pivot architectures through a set of axiomatic constraints and rules that guarantee their correctness. Then, we propose algorithms for the initial construction and the update (dynamic interoperability) of interlingual acception-based multilingual resources by exploiting the combinatorial properties of pairwise bilingual translation graphs. Secondly, we study the practical considerations of applying our construction algorithms on a tangible resource, DBNary, a resource periodically extracted from Wiktionary in many languages in lexical linked data. Désambigïsation lexicale multilingue Interopérabilité Ressources langagières Multilingual Word Sense Disambiguation Interoperability Multilingual Lexical Resources 004
25	Sémantická informace ze sítě FrameNet a možnosti jejího využití pro česká data / Semantic information from FrameNet and the possibility of its transfer to Czech data Limburská, Adéla January 2016 (has links) The thesis focuses on transferring FrameNet annotation from English to Czech and the possibilities of using the resulting data for automatic frame prediction in Czech. The first part, annotation transfer, has been performed in two ways. First, a parallel corpus of English sentences and their human created Czech translations (PCEDT) was used. Second, a much larger parallel corpus was created using ma- chine translation of FrameNet example sentences. This corpus was then used to transfer the annotation as well. The resulting data were partially evaluated and some of the automatically detectable errors were filtered out. Subsequently, the data were used as an input for two machine learning methods, decision trees and support vector machines. Since neither of the machine learning experiments brought impressive results, further manual correction of the data annotation was performed, which helped increase the accuracy of the prediction. However, as the accuracy reported in related papers is notably higher, the thesis also discusses dif- ferent approaches to feature selection and the possibility of further improvement of the prediction results using these methods. 1
26	A Feature Structure Approach for Disambiguating Preposition Senses Baglodi, Venkatesh 01 January 2009 (has links) Word Sense Disambiguation (WSD) continues to be an open research problem in spite of recent advances in the NLP field, especially in machine learning. WSD for open-class words is well understood. However, WSD for closed class structural words (such as prepositions) is not so well resolved, and their role in frame semantics seems to be a relatively unknown area. This research uses a new method to disambiguate preposition senses by using a combined lookup from FrameNet and TPP databases. Motivated by recent work by Popescu, Tonelli, & Pianta (2007), it extends the concept to provide a deterministic WSD of prepositions using the lexical information drawn from the sentences in a local context. While the primary goal of the research is to disambiguate preposition sense, the approach also assigns frames and roles to different sentence elements. The use of prepositions for frame and role assignment seems to be a largely unexplored area which could provide a new dimension to research in lexical semantics. knowledge-based methods natural language processing NLP preposition sense disambiguation word sense disambiguation WSD Computer Sciences
27	Graph-based Centrality Algorithms for Unsupervised Word Sense Disambiguation Sinha, Ravi Som 12 1900 (has links) This thesis introduces an innovative methodology of combining some traditional dictionary based approaches to word sense disambiguation (semantic similarity measures and overlap of word glosses, both based on WordNet) with some graph-based centrality methods, namely the degree of the vertices, Pagerank, closeness, and betweenness. The approach is completely unsupervised, and is based on creating graphs for the words to be disambiguated. We experiment with several possible combinations of the semantic similarity measures as the first stage in our experiments. The next stage attempts to score individual vertices in the graphs previously created based on several graph connectivity measures. During the final stage, several voting schemes are applied on the results obtained from the different centrality algorithms. The most important contributions of this work are not only that it is a novel approach and it works well, but also that it has great potential in overcoming the new-knowledge-acquisition bottleneck which has apparently brought research in supervised WSD as an explicit application to a plateau. The type of research reported in this thesis, which does not require manually annotated data, holds promise of a lot of new and interesting things, and our work is one of the first steps, despite being a small one, in this direction. The complete system is built and tested on standard benchmarks, and is comparable with work done on graph-based word sense disambiguation as well as lexical chains. The evaluation indicates that the right combination of the above mentioned metrics can be used to develop an unsupervised disambiguation engine as powerful as the state-of-the-art in WSD. Word sense measures of semantic similarity graph centrality algorithms disambiguation Computational linguistics. Semantics -- Data processing. Discourse analysis.
28	Création automatique d'un dictionnaire des régimes des verbes du français Hassert, Naïma 06 1900 (has links) Les dictionnaires de valence sont utiles dans plusieurs tâches en traitement automatique des langues. Or, les dictionnaires de qualité de ce type sont créés au moins en partie manuellement; ils nécessitent donc beaucoup de ressources et sont difficiles à mettre à jour. De plus, plusieurs de ces ressources ne prennent pas en compte les différents sens des lemmes, qui sont pourtant importants puisque les arguments sélectionnés ont tendance à varier selon le sens du verbe. Dans ce mémoire, nous créons automatiquement un dictionnaire de valence des verbes du français qui tient compte de la polysémie. Nous extrayons 20 000 exemples de phrases pour chacun des 2 000 verbes les plus fréquents du franc¸ais. Nous obtenons ensuite les plongements lexicaux de ces verbes en contexte à l’aide d’un modèle de langue monolingue et de deux modèles de langue multilingues. Puis, nous utilisons des algorithmes de regroupement pour induire les différents sens de ces verbes. Enfin, nous analysons automatiquement les phrases à l’aide de différents analyseurs syntaxiques afin de trouver leurs arguments. Nous déterminons que la combinaison du modèle de langue français CamemBERT et d’un algorithme de regroupement agglomératif offre les meilleurs résultats dans la tâche d’induction de sens (58,19% de F1 B3), et que pour l’analyse syntaxique, Stanza est l’outil qui a les meilleures performances (83,29% de F1). En filtrant les cadres syntaxiques obtenus à l’aide d’une estimation de la vraisemblance maximale, une méthode statistique très simple qui permet de trouver les paramètres les plus vraisemblables d’un modèle de probabilité qui explique nos données, nous construisons un dictionnaire de valence qui se passe presque complètement d’intervention humaine. Notre procédé est ici utilisé pour le français, mais peut être utilisé pour n’importe quelle autre langue pour laquelle il existe suffisamment de données écrites. / Valency dictionaries are useful for many tasks in automatic language processing. However, quality dictionaries of this type are created at least in part manually; they are therefore resource-intensive and difficult to update. In addition, many of these resources do not take into account the different meanings of lemmas, which are important because the arguments selected tend to vary according to the meaning of the verb. In this thesis, we automatically create a French verb valency dictionary that takes polysemy into account. We extract 20 000 example sentences for each of the 2 000 most frequent French verbs. We then obtain the lexical embeddings of these verbs in context using a monolingual and two multilingual language models. Then, we use clustering algorithms to induce the different meanings of these verbs. Finally, we automatically parse the sentences using different parsers to find their arguments. We determine that the combination of the French language model CamemBERT and an agglomerative clustering algorithm offers the best results in the sense induction task (58.19% of F1 B3), and that for syntactic parsing, Stanza is the tool with the best performance (83.29% of F1). By filtering the syntactic frames obtained using maximum likelihood estimation, a very simple statistical method for finding the most likely parameters of a probability model that explains our data, we build a valency dictionary that almost completely dispenses with human intervention. Our procedure is used here for French, but can be used for any other language for which sufficient written data exists. induction de sens valence lexicographie computationnelle word sense induction valency computational lexicography Linguistics / Linguistique (UMI : 0290)
29	Translation of keywords between English and Swedish / Översättning av nyckelord mellan engelska och svenska Ahmady, Tobias, Klein Rosmar, Sander January 2014 (has links) In this project, we have investigated how to perform rule-based machine translation of sets of keywords between two languages. The goal was to translate an input set, which contains one or more keywords in a source language, to a corresponding set of keywords, with the same number of elements, in the target language. However, some words in the source language may have several senses and may be translated to several, or no, words in the target language. If ambiguous translations occur, the best translation of the keyword should be chosen with respect to the context. In traditional machine translation, a word's context is determined by a phrase or sentences where the word occurs. In this project, the set of keywords represents the context. By investigating traditional approaches to machine translation (MT), we designed and described models for the specific purpose of keyword- translation. We have proposed a solution, based on direct translation for translating keywords between English and Swedish. In the proposed solu- tion, we also introduced a simple graph-based model for solving ambigu- ous translations. / I detta projekt har vi undersökt hur man utför regelbaserad maskinöver- sättning av nyckelord mellan två språk. Målet var att översätta en given mängd med ett eller flera nyckelord på ett källspråk till en motsvarande, lika stor mängd nyckelord på målspråket. Vissa ord i källspråket kan dock ha flera betydelser och kan översättas till flera, eller inga, ord på målsprå- ket. Om tvetydiga översättningar uppstår ska nyckelordets bästa över- sättning väljas med hänsyn till sammanhanget. I traditionell maskinö- versättning bestäms ett ords sammanhang av frasen eller meningen som det befinner sig i. I det här projektet representerar den givna mängden nyckelord sammanhanget. Genom att undersöka traditionella tillvägagångssätt för maskinöversätt- ning har vi designat och beskrivit modeller specifikt för översättning av nyckelord. Vi har presenterat en direkt maskinöversättningslösning av nyckelord mellan engelska och svenska där vi introducerat en enkel graf- baserad modell för tvetydiga översättningar. machine translation MT rule-based machine translation RBMT word sense disambiguation WSD translation disambiguation translation keyword translation maskinöversättning översättning tvetydiga översättningar disambiguering
30	Desambiguação automática de substantivos em corpus do português brasileiro / Word sense disambiguation in Brazilian Portuguese corpus Silva, Viviane Santos da 19 August 2016 (has links) O fenômeno da ambiguidade lexical foi o tópico central desta pesquisa, especialmente no que diz respeito às relações entre acepções de formas gráficas ambíguas e aos padrões de distribuição de acepções de palavras polissêmicas na língua, isto é, de palavras cujas acepções são semanticamente relacionadas. Este trabalho situa-se como uma proposta de interface entre explorações computacionais da ambiguidade lexical, especificamente de processamento de linguagem natural, e investigações de cunho teórico sobre o fenômeno do significado lexical. Partimos das noções de polissemia e de homonímia como correspondentes, respectivamente, ao caso de uma palavra com múltiplas acepções relacionadas e ao de duas (ou mais) palavras cujas formas gráficas coincidem, mas que apresentam acepções não relacionadas sincronicamente. Como objetivo último deste estudo, pretendia-se confirmar se as palavras mais polissêmicas teriam acepções menos uniformemente distribuídas no corpus, apresentando acepções predominantes, que ocorreriam com maior frequência. Para analisar esses aspectos, implementamos um algoritmo de desambiguação lexical, uma versão adaptada do algoritmo de Lesk (Lesk, 1986; Jurafsky & Martin, 2000), escolhido com base nos recursos linguísticos disponíveis para o português. Tendo como hipótese a noção de que palavras mais frequentes na língua tenderiam a ser mais polissêmicas, selecionamos do corpus (Mac-Morpho) aquelas com maiores ocorrências. Considerando-se o interesse em palavras de conteúdo e em casos de ambiguidade mais estritamente em nível semântico, optamos por realizar os testes apresentados neste trabalho apenas para substantivos. Os resultados obtidos com o algoritmo de desambiguação que implementamos superaram o método baseline baseado na heurística da acepção mais frequente: obtivemos 63% de acertos contra 50% do baseline para o total dos dados desambiguados. Esses resultados foram obtidos através do procedimento de desambiguação de pseudo-palavras (formadas ao acaso), utilizado em casos em que não se tem à disposição corpora semanticamente anotados. No entanto, em razão da dependência de inventários fixos de acepções oriundos de dicionários, pesquisamos maneiras alternativas de categorizar as acepções de uma palavra. Tomando como base o trabalho de Sproat & VanSanten (2001), implementamos um método que permite atribuir valores numéricos que atestam o quanto uma palavra se afastou da monossemia dentro de um determinado corpus. Essa medida, cunhada pelos autores do trabalho original como índice de polissemia, baseia-se no agrupamento de palavras co-ocorrentes à palavra-alvo da desambiguação de acordo com suas similaridades contextuais. Propusemos, neste trabalho, o uso de uma segunda medida, mencionada pelos autores apenas como um exemplo das aplicações potenciais do método a serem exploradas: a clusterização de co-ocorrentes com base em similaridades de contextos de uso. Essa segunda medida é obtida de forma que se possa verificar a proximidade entre acepções e a quantidade de acepções que uma palavra exibe no corpus. Alguns aspectos apontados nos resultados indicam o potencial do método de clusterização: os agrupamentos de co-ocorrentes obtidos são ponderados, ressaltando os grupos mais proeminentes de vizinhos da palavra-alvo; o fato de que os agrupamentos aproximam-se uns dos outros por medidas de similaridade contextual, o que pode servir para distinguir tendências homonímicas ou polissêmicas. Como exemplo, temos os clusters obtidos para a palavra produção: um relativo à ideia de produção literária e outro relativo à de produção agrícola. Esses dois clusters apresentaram distanciamento considerável, situando-se na faixa do que seria considerado um caso de polissemia, e apresentaram ambos pesos significativos, isto é, foram compostos por palavras mais relevantes. Identificamos três fatores principais que limitaram as análises a partir dos dados obtidos: o viés político-jornalístico do corpus que utilizamos (Mac-Morpho) e a necessidade de serem feitos mais testes variando os parâmetros de seleção de coocorrentes, uma vez que os parâmetros que utilizamos devem variar para outros corpora e, especialmente, pelo fato de termos realizados poucos testes para definir quais valores utilizaríamos para esses parâmetro, que são decisivos para a quantidade de palavras co-ocorrentes relevantes para os contextos de uso da palavra-alvo. Considerando-se tanto as vantagens quanto as limitações que observamos a partir dos resultados da clusterização, planejamos delinear um método sincrônico (que prescinde da documentação histórica das palavras) e computacional que permita distinguir casos de polissemia e de homonímia de forma mais sistemática e abrangendo uma maior quantidade de dados. Entendemos que um método dessa natureza pode ser de grade valia para os estudos do significado no nível lexical, permitindo o estabelecimento de um método objetivo e baseado em dados de uso da língua que vão além de exemplos pontuais. / The phenomenon of lexical ambiguity was the central topic of this research, especially with regard to relations between meanings of ambiguous graphic forms, and to patterns of distribution of the meanings of polysemous words in the language, that is, of words whose meanings are semantically related. This work is set on the interface between computational explorations of lexical ambiguity, specifically natural language processing, and theoretical investigations on the nature of research on the lexical meaning phenomenon. We assume the notions of polysemy and homonymy as corresponding, respectively, to the case of a word with multiple related meanings, and two (or more) words whose graphic forms coincide, but have unrelated meanings. The ultimate goal of this study was to confirm that the most polysemous words have meanings less evenly distributed in the corpus, with predominant meanings which occur more frequently. To examine these aspects, we implemented a word sense disambiguation algorithm, an adapted version of Lesk algorithm (Lesk, 1986; Jurafsky & Martin, 2000), chosen on the basis of the availability of language resources in Portuguese. From the hypothesis that the most frequent words in the language tend to be more polysemic, we selected from the corpus (Mac-Morpho) those words with the highest number occurrences. Considering our interest in content words and in cases of ambiguity more strictly to the semantic level, we decided to conduct the tests presented in this research only for nouns. The results obtained with the disambiguation algorithm implemented surpassed those of the baseline method based on the heuristics of the most frequent sense: we obtained 63% accuracy against 50% of baseline for all the disambiguated data. These results were obtained with the disambiguation procedure of pseudowords (formed at random), which used in cases where semantically annotated corpora are not available. However, due to the dependence of this disambiguation method on fixed inventories of meanings from dictionaries, we searched for alternative ways of categorizing the meanings of a word. Based on the work of Sproat & VanSanten (2001), we implemented a method for assigning numerical values that indicate how much one word is away from monosemy within a certain corpus. This measure, named by the authors of the original work as polysemy index, groups co-occurring words of the target noun according to their contextual similarities. We proposed in this paper the use of a second measure, mentioned by the authors as an example of the potential applications of the method to be explored: the clustering of the co-occurrent words based on their similarities of contexts of use. This second measurement is obtained so as to show the closeness of meanings and the amount of meanings that a word displays in the corpus. Some aspects pointed out in the results indicate the potential of the clustering method: the obtained co-occurring clusters are weighted, highlighting the most prominent groups of neighbors of the target word; the fact that the clusters aproximate from each other to each other on the basis of contextual similarity measures, which can be used to distinguish homonymic from polysemic trends. As an example, we have the clusters obtained for the word production, one referring to the idea of literary production, and the other referring to the notion of agricultural production. These two clusters exhibited considerable distance, standing in the range of what would be considered a case of polysemy, and both showed significant weights, that is, were composed of significant and distintictive words. We identified three main factors that have limited the analysis of the data: the political-journalistic bias of the corpus we use (Mac-Morpho) and the need for further testing by varying the selection parameters of relevant cooccurent words, since the parameters used shall vary for other corpora, and especially because of the fact that we conducted only a few tests to determine the values for these parameters, which are decisive for the amount of relevant co-occurring words for the target word. Considering both the advantages and the limitations we observe from the results of the clusterization method, we plan to design a synchronous (which dispenses with the historical documentation of the words) and, computational method to distinguish cases of polysemy and homonymy more systematically and covering a larger amount of data. We understand that a method of this nature can be invaluable for studies of the meaning on the lexical level, allowing the establishment of an objective method based on language usage data and, that goes beyond specific examples. Clusterização de contextos de palavras Computational Linguistics Desambiguação Lexical automática Linguística computacional Medidas de polissemia Polysemy index Word sense Disambiguation Word senses clusterization

Search results