1 |
Cross-lingual Information Retrieval On Turkish And English TextsBoynuegri, Akif 01 April 2010 (has links) (PDF)
In this thesis, cross-lingual information retrieval (CLIR) approaches are comparatively evaluated
for Turkish and English texts. As a complementary study, knowledge-based methods
for word sense disambiguation (WSD), which is one of the most important parts of the CLIR
studies, are compared for Turkish words.
Query translation and sense indexing based CLIR approaches are used in this study. In query
translation approach, we use automatic and manual word sense disambiguation methods and
Google translation service during translation of queries. In sense indexing based approach,
documents are indexed according to meanings of words instead of words themselves. Retrieval
of documents is performed according to meanings of the query words as well. During
the identification of intended meaning of query terms, manual and automatic word sense disambiguation
methods are used and compared to each other.
Knowledge based WSD methods that use different gloss enrichment techniques are compared
for Turkish words. Turkish WordNet is used as a primary knowledge base and English
WordNet and Turkish Wikipedia are employed as enrichment resources. Meanings of
words are more clearly identified by using semantic relations defined in WordNets and Turkish
Wikipedia. Also, during calculation of semantic relatedness of senses, cosine similarity
metric is used as an alternative metric to word overlap count. Effects of using cosine similarity
metric are observed for each WSD methods that use different knowledge bases.
|
2 |
Chinese-English cross-lingual information retrieval in biomedicine using ontology-based query expansionWang, Xinkai January 2011 (has links)
In this thesis, we propose a new approach to Chinese-English Biomedical cross-lingual information retrieval (CLIR) using query expansion based on the eCMeSH Tree, a Chinese-English ontology extended from the Chinese Medical Subject Headings (CMeSH) Tree. The CMeSH Tree is not designed for information retrieval (IR), since it only includes heading terms and has no term weighting scheme for these terms. Therefore, we design an algorithm, which employs a rule-based parsing technique combined with the C-value term extraction algorithm and a filtering technique based on mutual information, to extract Chinese synonyms for the corresponding heading terms. We also develop a term-weighting mechanism. Following the hierarchical structure of CMeSH, we extend the CMeSH Tree to the eCMeSH Tree with synonymous terms and their weights. We propose an algorithm to implement CLIR using the eCMeSH Tree terms to expand queries. In order to evaluate the retrieval improvements obtained from our approach, the results of the query expansion based on the eCMeSH Tree are individually compared with the results of the experiments of query expansion using the CMeSH Tree terms, query expansion using pseudo-relevance feedback, and document translation. We also evaluate the combinations of these three approaches. This study also investigates the factors which affect the CLIR performance, including a stemming algorithm, retrieval models, and word segmentation.
|
3 |
Effective Techniques for Indonesian Text RetrievalAsian, Jelita, jelitayang@gmail.com January 2007 (has links)
The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval.
|
4 |
Fouille de documents et d'opinions multilingue / Mining Documents and Sentiments in Cross-lingual ContextSaad, Motaz 20 January 2015 (has links)
L’objectif de cette thèse est d’étudier les sentiments dans les documents comparables. Premièrement, nous avons recueillis des corpus comparables en anglais, français et arabe de Wikipédia et d’Euronews, et nous avons aligné ces corpus au niveau document. Nous avons en plus collecté des documents d’informations des agences de presse locales et étrangères dans les langues anglaise et arabe. Les documents en anglais ont été recueillis du site de la BBC, ceux en arabe du site d’Al-Jazzera. Deuxièmement, nous avons présenté une mesure de similarité cross-linguistique des documents dans le but de récupérer et aligner automatiquement les documents comparables. Ensuite, nous avons proposé une méthode d’annotation cross-linguistique en termes de sentiments, afin d’étiqueter les documents source et cible avec des sentiments. Enfin, nous avons utilisé des mesures statistiques pour comparer l’accord des sentiments entre les documents comparables source et cible. Les méthodes présentées dans cette thèse ne dépendent pas d’une paire de langue bien déterminée, elles peuvent être appliquées sur toute autre couple de langue / The aim of this thesis is to study sentiments in comparable documents. First, we collect English, French and Arabic comparable corpora from Wikipedia and Euronews, and we align each corpus at the document level. We further gather English-Arabic news documents from local and foreign news agencies. The English documents are collected from BBC website and the Arabic documents are collected from Al-jazeera website. Second, we present a cross-lingual document similarity measure to automatically retrieve and align comparable documents. Then, we propose a cross-lingual sentiment annotation method to label source and target documents with sentiments. Finally, we use statistical measures to compare the agreement of sentiments in the source and the target pair of the comparable documents. The methods presented in this thesis are language independent and they can be applied on any language pair
|
Page generated in 0.1668 seconds