Toponym Disambiguation in Information RetrievalBuscaldi, Davide 12 November 2010 (has links)
In recent years, geography has acquired a great importance in the context of
Information Retrieval (IR) and, in general, of the automated processing of
information in text. Mobile devices that are able to surf the web and at the
same time inform about their position are now a common reality, together
with applications that can exploit this data to provide users with locally
customised information, such as directions or advertisements. Therefore,
it is important to deal properly with the geographic information that is
included in electronic texts. The majority of such kind of information is
contained as place names, or toponyms.
Toponym ambiguity represents an important issue in Geographical Information
Retrieval (GIR), due to the fact that queries are geographically constrained.
There has been a struggle to nd speci c geographical IR methods
that actually outperform traditional IR techniques. Toponym ambiguity
may constitute a relevant factor in the inability of current GIR systems to
take advantage from geographical knowledge. Recently, some Ph.D. theses
have dealt with Toponym Disambiguation (TD) from di erent perspectives,
from the development of resources for the evaluation of Toponym Disambiguation
(Leidner (2007)) to the use of TD to improve geographical scope
resolution (Andogah (2010)). The Ph.D. thesis presented here introduces
a TD method based on WordNet and carries out a detailed study of the
relationship of Toponym Disambiguation to some IR applications, such as
GIR, Question Answering (QA) and Web retrieval.
The work presented in this thesis starts with an introduction to the applications
in which TD may result useful, together with an analysis of the
ambiguity of toponyms in news collections. It could not be possible to
study the ambiguity of toponyms without studying the resources that are
used as placename repositories; these resources are the equivalent to language
dictionaries, which provide the di erent meanings of a given word. / Buscaldi, D. (2010). Toponym Disambiguation in Information Retrieval [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8912
Αλγόριθμοι και τεχνικές εξατομικευμένης αναζήτησης σε διαδικτυακά περιβάλλοντα με χρήση υποκείμενων σημασιολογιώνΠλέγας, Ιωάννης 06 December 2013 (has links)
Η τεράστια ανάπτυξη του Παγκόσμιου Ιστού τις τελευταίες δεκαετίες έχει αναδείξει την αναζήτηση πληροφοριών ως ένα από τα πιο σημαντικά ζητήματα στον τομέα της έρευνας στις Τεχνολογίες της Πληροφορικής.
Σήμερα, οι σύγχρονες μηχανές αναζήτησης απαντούν αρκετά ικανοποιητικά στα ερωτήματα των χρηστών, αλλά τα κορυφαία αποτελέσματα που επιστρέφονται δεν είναι πάντα σχετικά με τα δεδομένα που αναζητά ο χρήστης. Ως εκ τούτου, οι μηχανές αναζήτησης καταβάλλουν σημαντικές προσπάθειες για να κατατάξουν τα πιο σχετικά αποτελέσματα του ερωτήματος ως προς τον χρήστη στα κορυφαία αποτελέσματα της λίστας κατάταξης των αποτελεσμάτων. Η διατριβή αυτή ασχολείται κυρίως με το παραπάνω πρόβλημα, δηλαδή την κατάταξη στις υψηλότερες θέσεις των πιο σχετικών αποτελεσμάτων ως προς τον χρήστη (ειδικά για ερωτήματα που οι όροι τους έχουν πολλαπλές σημασίες).
Στο πλαίσιο της παρούσας έρευνας κατασκευάστηκαν αλγόριθμοι και τεχνικές που βασίζονται στην τεχνική της σχετικής ανατροφοδότησης (relevance feedback) για την βελτίωση των αποτελεσμάτων που επιστρέφονται από μια μηχανή αναζήτησης. Βασική πηγή της ανατροφοδότησης ήταν τα αποτελέσματα που επιλέγουν οι χρήστες κατά την διαδικασία πλοήγησης. Ο χρήστης επεκτείνει την αρχική πληροφορία αναζήτησης (λέξεις κλειδιά) με νέα πληροφορία που προέρχεται από τα αποτελέσματα που διαλέγει. Έχοντας ένα νέο σύνολο πληροφορίας που αφορά τις προτιμήσεις του χρήστη, συγκρίνεται η σημασιολογική πληροφορία του συνόλου αυτού με τα υπόλοιπα αποτελέσματα (αυτά που επιστράφηκαν πριν επιλέξει το συγκεκριμένο αποτέλεσμα) και μεταβάλλεται η σειρά των αποτελεσμάτων προωθώντας και προτείνοντας τα αποτελέσματα που είναι πιο σχετικά με το νέο σύνολο πληροφορίας.
Ένα άλλο πρόβλημα που πρέπει να αντιμετωπιστεί κατά την υποβολή ερωτημάτων από τους χρήστες σε μια μηχανή αναζήτησης είναι ότι τα ερωτήματα που υποβάλλονται στις μηχανές αναζήτησης είναι συνήθως μικρά σε αριθμό λέξεων και αμφίσημα. Συνεπώς, πρέπει να υπάρχουν τρόποι αποσαφήνισης των διαφορετικών εννοιών των όρων αναζήτησης και εύρεσης της έννοιας που ενδιαφέρει τον χρήστη. Η αποσαφήνιση των όρων αναζήτησης είναι μια διαδικασία που έχει μελετηθεί στην βιβλιογραφία με αρκετούς διαφορετικούς τρόπους. Στην διατριβή μου προτείνω νέες στρατηγικές αποσαφήνισης των εννοιών των όρων αναζήτησης των μηχανών αναζήτησης και εξερευνάται η αποδοτικότητά τους στις μηχανές αναζήτησης. Η καινοτομία τους έγκειται στη χρήση του Page-Rank σαν ενδείκτη της σημαντικότητας μιας έννοιας για έναν όρο του ερωτήματος.
Επίσης είναι ευρέως γνωστό ότι ο Παγκόσμιος Ιστός περιέχει έγγραφα με την ίδια πληροφορία και έγγραφα με σχεδόν ίδια πληροφορία. Παρά τις προσπάθειες των μηχανών αναζήτησης με αλγόριθμους εύρεσης των κειμένων που περιέχουν επικαλυπτόμενη πληροφορία, ακόμα υπάρχουν περιπτώσεις που τα κείμενα που ανακτώνται από μια μηχανή αναζήτησης περιέχουν επαναλαμβανόμενη πληροφορία. Στην διατριβή αυτή παρουσιάζονται αποδοτικές τεχνικές εύρεσης και περικοπής της επικαλυπτόμενης πληροφορίας από τα αποτελέσματα των μηχανών αναζήτησης χρησιμοποιώντας τις σημασιολογικές πληροφορίες των αποτελεσμάτων των μηχανών αναζήτησης. Συγκεκριμένα αναγνωρίζονται τα αποτελέσματα που περιέχουν την ίδια πληροφορία και απομακρύνονται, ενώ ταυτόχρονα τα αποτελέσματα που περιέχουν επικαλυπτόμενη πληροφορία συγχωνεύονται σε νέα κείμενα(SuperTexts) που περιέχουν την πληροφορία των αρχικών αποτελεσμάτων χωρίς να υπάρχει επαναλαμβανόμενη πληροφορία.
Ένας άλλος τρόπος βελτίωσης της αναζήτησης είναι ο σχολιασμός των κειμένων αναζήτησης έτσι ώστε να περιγράφεται καλύτερα η πληροφορία τους. Ο σχολιασμός κειμένων(text annotation) είναι μια τεχνική η οποία αντιστοιχίζει στις λέξεις του κειμένου επιπλέον πληροφορίες όπως η έννοια που αντιστοιχίζεται σε κάθε λέξη με βάση το εννοιολογικό περιεχόμενο του κειμένου. Η προσθήκη επιπλέον σημασιολογικών πληροφοριών σε ένα κείμενο βοηθάει τις μηχανές αναζήτησης να αναζητήσουν καλύτερα τις πληροφορίες που ενδιαφέρουν τους χρήστες και τους χρήστες να βρουν πιο εύκολα τις πληροφορίες που αναζητούν. Στην διατριβή αυτή αναλύονται αποδοτικές τεχνικές αυτόματου σχολιασμού κειμένων από τις οντότητες που περιέχονται στην Wikipedia, μια διαδικασία που αναφέρεται στην βιβλιογραφία ως Wikification. Με τον τρόπο αυτό οι χρήστες μπορούν να εξερευνήσουν επιπλέον πληροφορίες για τις οντότητες που περιέχονται στο κείμενο που τους επιστρέφεται.
Ένα άλλο τμήμα της διατριβής αυτής προσπαθεί να εκμεταλλευτεί την σημασιολογία των αποτελεσμάτων των μηχανών αναζήτησης χρησιμοποιώντας εργαλεία του Σημασιολογικού Ιστού. Ο στόχος του Σημασιολογικού Ιστού (Semantic Web) είναι να κάνει τους πόρους του Ιστού κατανοητούς και στους ανθρώπους και στις μηχανές. Ο Σημασιολογικός Ιστός στα πρώτα βήματά του λειτουργούσε σαν μια αναλυτική περιγραφή του σώματος των έγγραφων του Ιστού. Η ανάπτυξη εργαλείων για την αναζήτηση σε Σημασιολογικό Ιστό είναι ακόμα σε πρώιμο στάδιο. Οι σημερινές τεχνικές αναζήτησης δεν έχουν προσαρμοστεί στην δεικτοδότηση και στην ανάκτηση σημασιολογικής πληροφορίας εκτός από μερικές εξαιρέσεις. Στην έρευνά μας έχουν δημιουργηθεί αποδοτικές τεχνικές και εργαλεία χρήσης του Παγκόσμιου Ιστού. Συγκεκριμένα έχει κατασκευαστεί αλγόριθμος μετατροπής ενός κειμένου σε οντολογία ενσωματώνοντας την σημασιολογική και συντακτική του πληροφορία έτσι ώστε να επιστρέφονται στους χρήστες απαντήσεις σε ερωτήσεις φυσικής γλώσσας.
Επίσης στην διατριβή αυτή αναλύονται τεχνικές φιλτραρίσματος XML εγγράφων χρησιμοποιώντας σημασιολογικές πληροφορίες. Συγκεκριμένα παρουσιάζεται ένα αποδοτικό κατανεμημένο σύστημα σημασιολογικού φιλτραρίσματος XML εγγράφων που δίνει καλύτερα αποτελέσματα από τις υπάρχουσες προσεγγίσεις.
Τέλος, στα πλαίσια αυτής της διδακτορικής διατριβής γίνεται επιπλέον έρευνα για την βελτίωση της απόδοσης των μηχανών αναζήτησης από μια διαφορετική οπτική γωνία. Στην κατεύθυνση αυτή παρουσιάζονται τεχνικές περικοπής ανεστραμμένων λιστών ανεστραμμένων αρχείων. Επίσης επιτυγχάνεται ένας συνδυασμός των προτεινόμενων τεχνικών με υπάρχουσες τεχνικές συμπίεσης ανεστραμμένων αρχείων πράγμα που οδηγεί σε καλύτερα αποτελέσματα συμπίεσης από τα ήδη υπάρχοντα. / The tremendous growth of the Web in the recent decades has made the searching for information as one of the most important issues in research in Computer Technologies.
Today, modern search engines respond quite well to the user queries, but the results are not always relative to the data the user is looking for. Therefore, search engines are making significant efforts to rank the most relevant query results to the user in the top results of the ranking list. This work mainly deals with this problem, the ranking of the relevant results to the user in the top of the ranking list even when the queries contain multiple meanings. In the context of this research, algorithms and techniques were constructed based on the technique of relevance feedback which improves the results returned by a search engine. Main source of feedback are the results which the users selects during the navigation process. The user extends the original information (search keywords) with new information derived from the results that chooses. Having a new set of information concerning to the user's preferences, the relevancy of this information is compared with the other results (those returned before choosing this effect) and change the order of the results by promoting and suggesting the results that are more relevant to the new set of information.
Another problem that must be addressed when the users submit queries to the search engines is that the queries are usually small in number of words and ambiguous. Therefore, there must be ways to disambiguate the different concepts/senses and ways to find the concept/sense that interests the user. Disambiguation of the search terms is a process that has been studied in the literature in several different ways. This work proposes new strategies to disambiguate the senses/concepts of the search terms and explore their efficiency in search engines. Their innovation is the use of PageRank as an indicator of the importance of a sense/concept for a query term.
Another technique that exploits semantics in our work is the use of text annotation. The use of text annotation is a technique that assigns to the words of the text extra information such as the meaning assigned to each word based on the semantic content of the text. Assigning additional semantic information in a text helps users and search engines to seek or describe better the text information. In my thesis, techniques for improving the automatic annotation of small texts with entities from Wikipedia are presented, a process that referred in the literature as Wikification.
It is widely known that the Web contain documents with the same information and documents with almost identical information. Despite the efforts of the search engine’s algorithms to find the results that contain repeated information; there are still cases where the results retrieved by a search engine contain repeated information. In this work effective techniques are presented that find and cut the repeated information from the results of the search engines. Specifically, the results that contain the same information are removed, and the results that contain repeated information are merged into new texts (SuperTexts) that contain the information of the initial results without the repeated information.
Another part of this work tries to exploit the semantic information of search engine’s results using tools of the Semantic Web. The goal of the Semantic Web is to make the resources of the Web understandable to humans and machines. The Semantic Web in their first steps functioned as a detailed description of the body of the Web documents. The development of tools for querying Semantic Web is still in its infancy. The current search techniques are not adapted to the indexing and retrieval of semantic information with a few exceptions. In our research we have created efficient techniques and tools for using the Semantic Web. Specifically an algorithm was constructed that converts to ontology the search engine’s results integrating semantic and syntactic information in order to answer natural language questions.
Also this paper contains XML filtering techniques that use semantic information. Specifically, an efficient distributed system is proposed for the semantic filtering of XML documents that gives better results than the existing approaches.
Finally as part of this thesis is additional research that improves the performance of the search engines from a different angle. It is presented a technique for cutting the inverted lists of the inverted files. Specifically a combination of the proposed technique with existing compression techniques is achieved, leading to better compression results than the existing ones.
Encyclopaedic question answeringDornescu, Iustin January 2012 (has links)
Open-domain question answering (QA) is an established NLP task which enables users to search for speciVc pieces of information in large collections of texts. Instead of using keyword-based queries and a standard information retrieval engine, QA systems allow the use of natural language questions and return the exact answer (or a list of plausible answers) with supporting snippets of text. In the past decade, open-domain QA research has been dominated by evaluation fora such as TREC and CLEF, where shallow techniques relying on information redundancy have achieved very good performance. However, this performance is generally limited to simple factoid and deVnition questions because the answer is usually explicitly present in the document collection. Current approaches are much less successful in Vnding implicit answers and are diXcult to adapt to more complex question types which are likely to be posed by users. In order to advance the Veld of QA, this thesis proposes a shift in focus from simple factoid questions to encyclopaedic questions: list questions composed of several constraints. These questions have more than one correct answer which usually cannot be extracted from one small snippet of text. To correctly interpret the question, systems need to combine classic knowledge-based approaches with advanced NLP techniques. To Vnd and extract answers, systems need to aggregate atomic facts from heterogeneous sources as opposed to simply relying on keyword-based similarity. Encyclopaedic questions promote QA systems which use basic reasoning, making them more robust and easier to extend with new types of constraints and new types of questions. A novel semantic architecture is proposed which represents a paradigm shift in open-domain QA system design, using semantic concepts and knowledge representation instead of words and information retrieval. The architecture consists of two phases, analysis – responsible for interpreting questions and Vnding answers, and feedback – responsible for interacting with the user. This architecture provides the basis for EQUAL, a semantic QA system developed as part of the thesis, which uses Wikipedia as a source of world knowledge and iii employs simple forms of open-domain inference to answer encyclopaedic questions. EQUAL combines the output of a syntactic parser with semantic information from Wikipedia to analyse questions. To address natural language ambiguity, the system builds several formal interpretations containing the constraints speciVed by the user and addresses each interpretation in parallel. To Vnd answers, the system then tests these constraints individually for each candidate answer, considering information from diUerent documents and/or sources. The correctness of an answer is not proved using a logical formalism, instead a conVdence-based measure is employed. This measure reWects the validation of constraints from raw natural language, automatically extracted entities, relations and available structured and semi-structured knowledge from Wikipedia and the Semantic Web. When searching for and validating answers, EQUAL uses the Wikipedia link graph to Vnd relevant information. This method achieves good precision and allows only pages of a certain type to be considered, but is aUected by the incompleteness of the existing markup targeted towards human readers. In order to address this, a semantic analysis module which disambiguates entities is developed to enrich Wikipedia articles with additional links to other pages. The module increases recall, enabling the system to rely more on the link structure of Wikipedia than on word-based similarity between pages. It also allows authoritative information from diUerent sources to be linked to the encyclopaedia, further enhancing the coverage of the system. The viability of the proposed approach was evaluated in an independent setting by participating in two competitions at CLEF 2008 and 2009. In both competitions, EQUAL outperformed standard textual QA systems as well as semi-automatic approaches. Having established a feasible way forward for the design of open-domain QA systems, future work will attempt to further improve performance to take advantage of recent advances in information extraction and knowledge representation, as well as by experimenting with formal reasoning and inferencing capabilities.
Metacognitive development and the disambiguation effect in monolingual and bilingual childrenGollek, Cornelia January 2013 (has links)
Research suggests that children are only able to flexibly apply more than one label (e.g. mouse and animal) in one situation with one conversational partner after they pass standard false belief tasks. Both abilities have been attributed to the understanding of perspective. The aim of the studies was to extend previous research to examine the disambiguation effect, children’s tendency to select an unfamiliar object in the presence of another but familiar object as referent for a novel word. Theoretical considerations suggest this effect initially results from a lack of understanding perspective. Five studies were conducted in Scotland and Austria, involving 243 children between the ages of 2.5 and 6.5. Studies 1 to 3 compared the standard disambiguation task with a task in which a strong pragmatic cue indicates the familiar object is the correct referent. Performances on these tasks were compared with performances on the false belief task, the alternative naming task, as well as tests of executive functioning. Studies 4 and 5 extended these methods to examine bilingual children’s metacognitive abilities in relation to word learning. Children become able to suspend the disambiguation effect when presented with strong pragmatic cues at the same time as they pass false belief and alternative naming tasks (Experiment 1). This can neither be attributed to impulsivity or the ability to inhibit a response, nor order effects of pragmatic cues and novel words (Experiment 2). Children’s ability to apply two labels to one object in a correction task also related to their perspectival understanding. Previous findings that suggested that younger children could produce multiple labels in a misnaming paradigm were not replicated (Experiment 3 a, b). The developmental change in children’s metalinguistic behaviour was demonstrated to follow the same trajectory in monolinguals, bilinguals and children exposed to another language (Experiment 4 and 5). Bilinguals show a marginally better ability to recall novel foreign language labels. The disambiguation effect is the result of cognitive immaturity in young children. Older children show a change in behaviour at the same time as they present more metacognitive maturity. Common development with theory of mind and metalinguistic abilities is attributed to an understanding of perspective.
Ambiguous synonyms : Implementing an unsupervised WSD system for division of synonym clusters containing multiple sensesWallin, Moa January 2019 (has links)
When clustering together synonyms, complications arise in cases of the words having multiple senses as each sense’s synonyms are erroneously clustered together. The task of automatically distinguishing word senses in cases of ambiguity, known as word sense disambiguation (WSD), has been an extensively researched problem over the years. This thesis studies the possibility of applying an unsupervised machine learning based WSD-system for analysing existing synonym clusters (N = 149) and dividing them correctly when two or more senses are present. Based on sense embeddings induced from a large corpus, cosine similarities are calculated between sense embeddings for words in the clusters, making it possible to suggest divisions in cases where different words are closer to different senses of a proposed ambiguous word. The system output is then evaluated by four participants, all experts in the area. The results show that the system does not manage to correctly divide the clusters in more than 31% of the cases according to the participants. Moreover, it is discovered that some differences exist between the participants’ ratings, although none of the participants predominantly agree with the system’s division of the clusters. Evidently, further research and improvements are needed and suggested for the future.
Desambiguação automática de substantivos em corpus do português brasileiro / Word sense disambiguation in Brazilian Portuguese corpusSilva, Viviane Santos da 19 August 2016 (has links)
O fenômeno da ambiguidade lexical foi o tópico central desta pesquisa, especialmente no que diz respeito às relações entre acepções de formas gráficas ambíguas e aos padrões de distribuição de acepções de palavras polissêmicas na língua, isto é, de palavras cujas acepções são semanticamente relacionadas. Este trabalho situa-se como uma proposta de interface entre explorações computacionais da ambiguidade lexical, especificamente de processamento de linguagem natural, e investigações de cunho teórico sobre o fenômeno do significado lexical. Partimos das noções de polissemia e de homonímia como correspondentes, respectivamente, ao caso de uma palavra com múltiplas acepções relacionadas e ao de duas (ou mais) palavras cujas formas gráficas coincidem, mas que apresentam acepções não relacionadas sincronicamente. Como objetivo último deste estudo, pretendia-se confirmar se as palavras mais polissêmicas teriam acepções menos uniformemente distribuídas no corpus, apresentando acepções predominantes, que ocorreriam com maior frequência. Para analisar esses aspectos, implementamos um algoritmo de desambiguação lexical, uma versão adaptada do algoritmo de Lesk (Lesk, 1986; Jurafsky & Martin, 2000), escolhido com base nos recursos linguísticos disponíveis para o português. Tendo como hipótese a noção de que palavras mais frequentes na língua tenderiam a ser mais polissêmicas, selecionamos do corpus (Mac-Morpho) aquelas com maiores ocorrências. Considerando-se o interesse em palavras de conteúdo e em casos de ambiguidade mais estritamente em nível semântico, optamos por realizar os testes apresentados neste trabalho apenas para substantivos. Os resultados obtidos com o algoritmo de desambiguação que implementamos superaram o método baseline baseado na heurística da acepção mais frequente: obtivemos 63% de acertos contra 50% do baseline para o total dos dados desambiguados. Esses resultados foram obtidos através do procedimento de desambiguação de pseudo-palavras (formadas ao acaso), utilizado em casos em que não se tem à disposição corpora semanticamente anotados. No entanto, em razão da dependência de inventários fixos de acepções oriundos de dicionários, pesquisamos maneiras alternativas de categorizar as acepções de uma palavra. Tomando como base o trabalho de Sproat & VanSanten (2001), implementamos um método que permite atribuir valores numéricos que atestam o quanto uma palavra se afastou da monossemia dentro de um determinado corpus. Essa medida, cunhada pelos autores do trabalho original como índice de polissemia, baseia-se no agrupamento de palavras co-ocorrentes à palavra-alvo da desambiguação de acordo com suas similaridades contextuais. Propusemos, neste trabalho, o uso de uma segunda medida, mencionada pelos autores apenas como um exemplo das aplicações potenciais do método a serem exploradas: a clusterização de co-ocorrentes com base em similaridades de contextos de uso. Essa segunda medida é obtida de forma que se possa verificar a proximidade entre acepções e a quantidade de acepções que uma palavra exibe no corpus. Alguns aspectos apontados nos resultados indicam o potencial do método de clusterização: os agrupamentos de co-ocorrentes obtidos são ponderados, ressaltando os grupos mais proeminentes de vizinhos da palavra-alvo; o fato de que os agrupamentos aproximam-se uns dos outros por medidas de similaridade contextual, o que pode servir para distinguir tendências homonímicas ou polissêmicas. Como exemplo, temos os clusters obtidos para a palavra produção: um relativo à ideia de produção literária e outro relativo à de produção agrícola. Esses dois clusters apresentaram distanciamento considerável, situando-se na faixa do que seria considerado um caso de polissemia, e apresentaram ambos pesos significativos, isto é, foram compostos por palavras mais relevantes. Identificamos três fatores principais que limitaram as análises a partir dos dados obtidos: o viés político-jornalístico do corpus que utilizamos (Mac-Morpho) e a necessidade de serem feitos mais testes variando os parâmetros de seleção de coocorrentes, uma vez que os parâmetros que utilizamos devem variar para outros corpora e, especialmente, pelo fato de termos realizados poucos testes para definir quais valores utilizaríamos para esses parâmetro, que são decisivos para a quantidade de palavras co-ocorrentes relevantes para os contextos de uso da palavra-alvo. Considerando-se tanto as vantagens quanto as limitações que observamos a partir dos resultados da clusterização, planejamos delinear um método sincrônico (que prescinde da documentação histórica das palavras) e computacional que permita distinguir casos de polissemia e de homonímia de forma mais sistemática e abrangendo uma maior quantidade de dados. Entendemos que um método dessa natureza pode ser de grade valia para os estudos do significado no nível lexical, permitindo o estabelecimento de um método objetivo e baseado em dados de uso da língua que vão além de exemplos pontuais. / The phenomenon of lexical ambiguity was the central topic of this research, especially with regard to relations between meanings of ambiguous graphic forms, and to patterns of distribution of the meanings of polysemous words in the language, that is, of words whose meanings are semantically related. This work is set on the interface between computational explorations of lexical ambiguity, specifically natural language processing, and theoretical investigations on the nature of research on the lexical meaning phenomenon. We assume the notions of polysemy and homonymy as corresponding, respectively, to the case of a word with multiple related meanings, and two (or more) words whose graphic forms coincide, but have unrelated meanings. The ultimate goal of this study was to confirm that the most polysemous words have meanings less evenly distributed in the corpus, with predominant meanings which occur more frequently. To examine these aspects, we implemented a word sense disambiguation algorithm, an adapted version of Lesk algorithm (Lesk, 1986; Jurafsky & Martin, 2000), chosen on the basis of the availability of language resources in Portuguese. From the hypothesis that the most frequent words in the language tend to be more polysemic, we selected from the corpus (Mac-Morpho) those words with the highest number occurrences. Considering our interest in content words and in cases of ambiguity more strictly to the semantic level, we decided to conduct the tests presented in this research only for nouns. The results obtained with the disambiguation algorithm implemented surpassed those of the baseline method based on the heuristics of the most frequent sense: we obtained 63% accuracy against 50% of baseline for all the disambiguated data. These results were obtained with the disambiguation procedure of pseudowords (formed at random), which used in cases where semantically annotated corpora are not available. However, due to the dependence of this disambiguation method on fixed inventories of meanings from dictionaries, we searched for alternative ways of categorizing the meanings of a word. Based on the work of Sproat & VanSanten (2001), we implemented a method for assigning numerical values that indicate how much one word is away from monosemy within a certain corpus. This measure, named by the authors of the original work as polysemy index, groups co-occurring words of the target noun according to their contextual similarities. We proposed in this paper the use of a second measure, mentioned by the authors as an example of the potential applications of the method to be explored: the clustering of the co-occurrent words based on their similarities of contexts of use. This second measurement is obtained so as to show the closeness of meanings and the amount of meanings that a word displays in the corpus. Some aspects pointed out in the results indicate the potential of the clustering method: the obtained co-occurring clusters are weighted, highlighting the most prominent groups of neighbors of the target word; the fact that the clusters aproximate from each other to each other on the basis of contextual similarity measures, which can be used to distinguish homonymic from polysemic trends. As an example, we have the clusters obtained for the word production, one referring to the idea of literary production, and the other referring to the notion of agricultural production. These two clusters exhibited considerable distance, standing in the range of what would be considered a case of polysemy, and both showed significant weights, that is, were composed of significant and distintictive words. We identified three main factors that have limited the analysis of the data: the political-journalistic bias of the corpus we use (Mac-Morpho) and the need for further testing by varying the selection parameters of relevant cooccurent words, since the parameters used shall vary for other corpora, and especially because of the fact that we conducted only a few tests to determine the values for these parameters, which are decisive for the amount of relevant co-occurring words for the target word. Considering both the advantages and the limitations we observe from the results of the clusterization method, we plan to design a synchronous (which dispenses with the historical documentation of the words) and, computational method to distinguish cases of polysemy and homonymy more systematically and covering a larger amount of data. We understand that a method of this nature can be invaluable for studies of the meaning on the lexical level, allowing the establishment of an objective method based on language usage data and, that goes beyond specific examples.
Desambiguação lexical de revisões de itens aplicada em sistemas de recomendação / Word sense disambiguation of items revisions applied in recommendation systemsMarinho, Ronnie Shida 14 May 2018 (has links)
Com o intuito de auxiliar usuários na procura por produtos relevantes, sistemas Web integraram módulos de recomendação de itens, que selecionam automaticamente conteúdo de acordo com os interesses de cada indivíduo. Apesar de existirem diversas abordagens para calcular recomendações de acordo com interações disponíveis no sistema, a maioria delas sofre com a carência de informações utilizadas para caracterizar as preferências dos usuários e as descrições dos itens. Trabalhos recentes sobre sistemas de recomendação têm estudado a possibilidade de utilizar revisões de usuários como fonte de metadados, já que são criadas colaborativamente pelos indivíduos. Entretanto, ainda carecem de estudos sobre como organizar e estruturar os dados de maneira semântica. Desta maneira, este trabalho tem como objetivo desenvolver técnicas de construção de representação de itens baseadas em descrições colaborativas para um sistema de recomendação. Objetiva-se analisar o impacto que métodos distintos de desambiguação lexical de sentido causam na precisão da recomendação, sendo avaliada no cenário de predição de notas. A partir dessa estruturação, é possível caracterizar os itens e usuários de maneira mais eficiente, favorecendo o cálculo da recomendação de acordo com as preferências do indivíduo. / Web systems integrate recommending modules for items, which automatically select content according to the interest of each individual in order to help users in the search for relevant products. Although there are diverse recommending approaches to calculate recommendations according to users preferences, most of them lack information to characterize users preferences and item descriptions. Recent researches on recommender systems have studied the possibility of using users reviews as source of metadata, because users create them collaboratively. However, the literature still lacks studies about how to organize and structure data in a semantic manner. Therefore, this study aims to develop techniques for constructing the representation of items based on collaborative descriptions for recommender systems. For this reason, it is also aimed to analyze the impact caused by distinct methods of word sense disambiguation on the precision of recommendations, which we analyzed in the scenario of ratings predictions. Our results showed that we can characterize users and items in a more efficient way, favoring the calculation of recommendations according to users preferences.
Desambiguação lexical de revisões de itens aplicada em sistemas de recomendação / Word sense disambiguation of items revisions applied in recommendation systemsRonnie Shida Marinho 14 May 2018 (has links)
Com o intuito de auxiliar usuários na procura por produtos relevantes, sistemas Web integraram módulos de recomendação de itens, que selecionam automaticamente conteúdo de acordo com os interesses de cada indivíduo. Apesar de existirem diversas abordagens para calcular recomendações de acordo com interações disponíveis no sistema, a maioria delas sofre com a carência de informações utilizadas para caracterizar as preferências dos usuários e as descrições dos itens. Trabalhos recentes sobre sistemas de recomendação têm estudado a possibilidade de utilizar revisões de usuários como fonte de metadados, já que são criadas colaborativamente pelos indivíduos. Entretanto, ainda carecem de estudos sobre como organizar e estruturar os dados de maneira semântica. Desta maneira, este trabalho tem como objetivo desenvolver técnicas de construção de representação de itens baseadas em descrições colaborativas para um sistema de recomendação. Objetiva-se analisar o impacto que métodos distintos de desambiguação lexical de sentido causam na precisão da recomendação, sendo avaliada no cenário de predição de notas. A partir dessa estruturação, é possível caracterizar os itens e usuários de maneira mais eficiente, favorecendo o cálculo da recomendação de acordo com as preferências do indivíduo. / Web systems integrate recommending modules for items, which automatically select content according to the interest of each individual in order to help users in the search for relevant products. Although there are diverse recommending approaches to calculate recommendations according to users preferences, most of them lack information to characterize users preferences and item descriptions. Recent researches on recommender systems have studied the possibility of using users reviews as source of metadata, because users create them collaboratively. However, the literature still lacks studies about how to organize and structure data in a semantic manner. Therefore, this study aims to develop techniques for constructing the representation of items based on collaborative descriptions for recommender systems. For this reason, it is also aimed to analyze the impact caused by distinct methods of word sense disambiguation on the precision of recommendations, which we analyzed in the scenario of ratings predictions. Our results showed that we can characterize users and items in a more efficient way, favoring the calculation of recommendations according to users preferences.
Desambiguação automática de substantivos em corpus do português brasileiro / Word sense disambiguation in Brazilian Portuguese corpusViviane Santos da Silva 19 August 2016 (has links)
O fenômeno da ambiguidade lexical foi o tópico central desta pesquisa, especialmente no que diz respeito às relações entre acepções de formas gráficas ambíguas e aos padrões de distribuição de acepções de palavras polissêmicas na língua, isto é, de palavras cujas acepções são semanticamente relacionadas. Este trabalho situa-se como uma proposta de interface entre explorações computacionais da ambiguidade lexical, especificamente de processamento de linguagem natural, e investigações de cunho teórico sobre o fenômeno do significado lexical. Partimos das noções de polissemia e de homonímia como correspondentes, respectivamente, ao caso de uma palavra com múltiplas acepções relacionadas e ao de duas (ou mais) palavras cujas formas gráficas coincidem, mas que apresentam acepções não relacionadas sincronicamente. Como objetivo último deste estudo, pretendia-se confirmar se as palavras mais polissêmicas teriam acepções menos uniformemente distribuídas no corpus, apresentando acepções predominantes, que ocorreriam com maior frequência. Para analisar esses aspectos, implementamos um algoritmo de desambiguação lexical, uma versão adaptada do algoritmo de Lesk (Lesk, 1986; Jurafsky & Martin, 2000), escolhido com base nos recursos linguísticos disponíveis para o português. Tendo como hipótese a noção de que palavras mais frequentes na língua tenderiam a ser mais polissêmicas, selecionamos do corpus (Mac-Morpho) aquelas com maiores ocorrências. Considerando-se o interesse em palavras de conteúdo e em casos de ambiguidade mais estritamente em nível semântico, optamos por realizar os testes apresentados neste trabalho apenas para substantivos. Os resultados obtidos com o algoritmo de desambiguação que implementamos superaram o método baseline baseado na heurística da acepção mais frequente: obtivemos 63% de acertos contra 50% do baseline para o total dos dados desambiguados. Esses resultados foram obtidos através do procedimento de desambiguação de pseudo-palavras (formadas ao acaso), utilizado em casos em que não se tem à disposição corpora semanticamente anotados. No entanto, em razão da dependência de inventários fixos de acepções oriundos de dicionários, pesquisamos maneiras alternativas de categorizar as acepções de uma palavra. Tomando como base o trabalho de Sproat & VanSanten (2001), implementamos um método que permite atribuir valores numéricos que atestam o quanto uma palavra se afastou da monossemia dentro de um determinado corpus. Essa medida, cunhada pelos autores do trabalho original como índice de polissemia, baseia-se no agrupamento de palavras co-ocorrentes à palavra-alvo da desambiguação de acordo com suas similaridades contextuais. Propusemos, neste trabalho, o uso de uma segunda medida, mencionada pelos autores apenas como um exemplo das aplicações potenciais do método a serem exploradas: a clusterização de co-ocorrentes com base em similaridades de contextos de uso. Essa segunda medida é obtida de forma que se possa verificar a proximidade entre acepções e a quantidade de acepções que uma palavra exibe no corpus. Alguns aspectos apontados nos resultados indicam o potencial do método de clusterização: os agrupamentos de co-ocorrentes obtidos são ponderados, ressaltando os grupos mais proeminentes de vizinhos da palavra-alvo; o fato de que os agrupamentos aproximam-se uns dos outros por medidas de similaridade contextual, o que pode servir para distinguir tendências homonímicas ou polissêmicas. Como exemplo, temos os clusters obtidos para a palavra produção: um relativo à ideia de produção literária e outro relativo à de produção agrícola. Esses dois clusters apresentaram distanciamento considerável, situando-se na faixa do que seria considerado um caso de polissemia, e apresentaram ambos pesos significativos, isto é, foram compostos por palavras mais relevantes. Identificamos três fatores principais que limitaram as análises a partir dos dados obtidos: o viés político-jornalístico do corpus que utilizamos (Mac-Morpho) e a necessidade de serem feitos mais testes variando os parâmetros de seleção de coocorrentes, uma vez que os parâmetros que utilizamos devem variar para outros corpora e, especialmente, pelo fato de termos realizados poucos testes para definir quais valores utilizaríamos para esses parâmetro, que são decisivos para a quantidade de palavras co-ocorrentes relevantes para os contextos de uso da palavra-alvo. Considerando-se tanto as vantagens quanto as limitações que observamos a partir dos resultados da clusterização, planejamos delinear um método sincrônico (que prescinde da documentação histórica das palavras) e computacional que permita distinguir casos de polissemia e de homonímia de forma mais sistemática e abrangendo uma maior quantidade de dados. Entendemos que um método dessa natureza pode ser de grade valia para os estudos do significado no nível lexical, permitindo o estabelecimento de um método objetivo e baseado em dados de uso da língua que vão além de exemplos pontuais. / The phenomenon of lexical ambiguity was the central topic of this research, especially with regard to relations between meanings of ambiguous graphic forms, and to patterns of distribution of the meanings of polysemous words in the language, that is, of words whose meanings are semantically related. This work is set on the interface between computational explorations of lexical ambiguity, specifically natural language processing, and theoretical investigations on the nature of research on the lexical meaning phenomenon. We assume the notions of polysemy and homonymy as corresponding, respectively, to the case of a word with multiple related meanings, and two (or more) words whose graphic forms coincide, but have unrelated meanings. The ultimate goal of this study was to confirm that the most polysemous words have meanings less evenly distributed in the corpus, with predominant meanings which occur more frequently. To examine these aspects, we implemented a word sense disambiguation algorithm, an adapted version of Lesk algorithm (Lesk, 1986; Jurafsky & Martin, 2000), chosen on the basis of the availability of language resources in Portuguese. From the hypothesis that the most frequent words in the language tend to be more polysemic, we selected from the corpus (Mac-Morpho) those words with the highest number occurrences. Considering our interest in content words and in cases of ambiguity more strictly to the semantic level, we decided to conduct the tests presented in this research only for nouns. The results obtained with the disambiguation algorithm implemented surpassed those of the baseline method based on the heuristics of the most frequent sense: we obtained 63% accuracy against 50% of baseline for all the disambiguated data. These results were obtained with the disambiguation procedure of pseudowords (formed at random), which used in cases where semantically annotated corpora are not available. However, due to the dependence of this disambiguation method on fixed inventories of meanings from dictionaries, we searched for alternative ways of categorizing the meanings of a word. Based on the work of Sproat & VanSanten (2001), we implemented a method for assigning numerical values that indicate how much one word is away from monosemy within a certain corpus. This measure, named by the authors of the original work as polysemy index, groups co-occurring words of the target noun according to their contextual similarities. We proposed in this paper the use of a second measure, mentioned by the authors as an example of the potential applications of the method to be explored: the clustering of the co-occurrent words based on their similarities of contexts of use. This second measurement is obtained so as to show the closeness of meanings and the amount of meanings that a word displays in the corpus. Some aspects pointed out in the results indicate the potential of the clustering method: the obtained co-occurring clusters are weighted, highlighting the most prominent groups of neighbors of the target word; the fact that the clusters aproximate from each other to each other on the basis of contextual similarity measures, which can be used to distinguish homonymic from polysemic trends. As an example, we have the clusters obtained for the word production, one referring to the idea of literary production, and the other referring to the notion of agricultural production. These two clusters exhibited considerable distance, standing in the range of what would be considered a case of polysemy, and both showed significant weights, that is, were composed of significant and distintictive words. We identified three main factors that have limited the analysis of the data: the political-journalistic bias of the corpus we use (Mac-Morpho) and the need for further testing by varying the selection parameters of relevant cooccurent words, since the parameters used shall vary for other corpora, and especially because of the fact that we conducted only a few tests to determine the values for these parameters, which are decisive for the amount of relevant co-occurring words for the target word. Considering both the advantages and the limitations we observe from the results of the clusterization method, we plan to design a synchronous (which dispenses with the historical documentation of the words) and, computational method to distinguish cases of polysemy and homonymy more systematically and covering a larger amount of data. We understand that a method of this nature can be invaluable for studies of the meaning on the lexical level, allowing the establishment of an objective method based on language usage data and, that goes beyond specific examples.
Μεθοδολογία αυτόματου σημασιολογικού σχολιασμού στο περιεχόμενο ιστοσελίδωνΣπύρος, Γεώργιος 14 December 2009 (has links)
Στις μέρες μας η χρήση του παγκόσμιου ιστού έχει εξελιχθεί σε ένα κοινωνικό φαινόμενο. Η εξάπλωσή του είναι συνεχής και εκθετικά αυξανόμενη. Στα χρόνια που έχουν μεσολαβήσει από την εμφάνισή του, οι χρήστες έχουν αποκτήσει ένα βαθμό εμπειρίας και έχει γίνει από πλευράς τους ένα σύνολο αποδοχών βασισμένων σε αυτή ακριβώς την εμπειρία από τη χρήση του παγκόσμιου ιστού. Πιο συγκεκριμένα έχει γίνει αντιληπτό από τους χρήστες το γεγονός ότι οι ιστοσελίδες με τις οποίες αλληλεπιδρούν καθημερινά σχεδόν είναι δημιουργήματα κάποιων άλλων χρηστών. Επίσης έχει γίνει αντιληπτό ότι ο κάθε χρήστης μπορεί να δημιουργήσει τη δική του ιστοσελίδα και μάλιστα να περιλάβει σε αυτή αναφορές προς μια άλλη ιστοσελίδα κάποιου άλλου χρήστη. Οι αναφορές αυτές όμως, συνήθως δεν εμφανίζονται απλά και μόνο με τη μορφή ενός υπερσυνδέσμου. Τις περισσότερες φορές υπάρχει και κείμενο που τις συνοδεύει και που παρέχει πληροφορίες για το περιεχόμενο της αναφερόμενης ιστοσελίδας.
Σε αυτή τη διπλωματική εργασία περιγράφουμε μια μεθοδολογία για τον αυτόματο σημασιολογικό σχολιασμό του περιεχομένου ιστοσελίδων. Τα εργαλεία και οι τεχνικές που περιγράφονται βασίζονται σε δύο κύριες υποθέσεις. Πρώτον, οι άνθρωποι που δημιουργούν και διατηρούν ιστοσελίδες περιγράφουν άλλες ιστοσελίδες μέσα σε αυτές. Δεύτερον, οι άνθρωποι συνδέουν τις ιστοσελίδες τους με την εκάστοτε ιστοσελίδα την οποία περιγράφουν μέσω ενός συνδέσμου αγκύρωσης (anchor link) που είναι καθαρά σημαδεμένος με μία συγκεκριμένη ετικέτα (tag) μέσα στον εκάστοτε HTML κώδικα.
Ο αυτόματος σημασιολογικός σχολιασμός που επιχειρούμε για μια ιστοσελίδα ισοδυναμεί με την εύρεση μιας ετικέτας (tag) ικανής να περιγράψει το περιεχόμενο της. Η εύρεση αυτής της ετικέτας είναι μια διαδικασία που βασίζεται σε μία συγκεκριμένη μεθοδολογία που αποτελείται από ένα συγκεκριμένο αριθμό βημάτων. Κάθε βήμα από αυτά υλοποιείται με τη χρήση διαφόρων εργαλείων και τεχνικών και τροφοδοτεί με την έξοδό του την είσοδο του επόμενου βήματος.
Βασική ιδέα της μεθοδολογίας είναι η συλλογή αρκετών κειμένων αγκύρωσης (anchor texts), καθώς και ενός μέρους του γειτονικού τους κειμένου, για μία ιστοσελίδα. Η συλλογή αυτή προκύπτει ύστερα από επεξεργασία αρκετών ιστοσελίδων που περιέχουν υπερσυνδέσμους προς τη συγκεκριμένη ιστοσελίδα. Η σημασιολογική ετικέτα για μια ιστοσελίδα προκύπτει από την εφαρμογή διαφόρων τεχνικών γλωσσολογικής επεξεργασίας στη συλλογή των κειμένων που την αφορούν. Έτσι προκύπτει το τελικό συμπέρασμα για το σημασιολογικό σχολιασμό του περιεχομένου της ιστοσελίδας. / Nowadays the World Wide Web usage has evolved into a social phenomenon. It’s spread is constant and it’s increasing exponentially. During the years that have passed since it’s first appearance, the users have gained a certain level of experience and they have made some acceptances through this experience. They have understood that the web pages with which they interact in their everyday web activities, are creations from some other users. It has also become clear that every user can create his own web page and include in it references to some other pages of his liking. These references don’t simply exist as hyperlinks. Most of the time they are accompanied by some text which provides useful information about the referenced page’s content.
In this diploma thesis we describe a methodology for the automatic annotation of a web page’s contents. The tools and techniques that are described, are based in two main hypotheses. First, humans that create web pages describe other web pages inside them. Second, humans connect their web pages with any web page they describe via an anchor link which is clearly described with a tag in each page’s HTML code.
The automatic semantic annotation that we attempt here for a web page is the process of finding a tag able to describe the page’s contents. The finding of this tag is a process based in a certain methodology which consists of a number of steps. Each step of these is implemented using various tools and techniques and his output is the next step’s input.
The basic idea behind our methodology is to collect as many anchor texts as possible, along with a window of words around them, for each web page. This collection is the result of a procedure which involves the processing of many web pages that contain hyperlinks to the web page which we want to annotate. The semantic tag for a web page is derived from the usage of certain natural language processing techniques in the collection of documents that refer to the web page. Thus the final conclusion for the web page’s contents annotation is extracted.
