Spelling suggestions: "subject:"crosslingual"" "subject:"crosslingualen""
11 |
Cross-lingual Information Retrieval On Turkish And English TextsBoynuegri, Akif 01 April 2010 (has links) (PDF)
In this thesis, cross-lingual information retrieval (CLIR) approaches are comparatively evaluated
for Turkish and English texts. As a complementary study, knowledge-based methods
for word sense disambiguation (WSD), which is one of the most important parts of the CLIR
studies, are compared for Turkish words.
Query translation and sense indexing based CLIR approaches are used in this study. In query
translation approach, we use automatic and manual word sense disambiguation methods and
Google translation service during translation of queries. In sense indexing based approach,
documents are indexed according to meanings of words instead of words themselves. Retrieval
of documents is performed according to meanings of the query words as well. During
the identification of intended meaning of query terms, manual and automatic word sense disambiguation
methods are used and compared to each other.
Knowledge based WSD methods that use different gloss enrichment techniques are compared
for Turkish words. Turkish WordNet is used as a primary knowledge base and English
WordNet and Turkish Wikipedia are employed as enrichment resources. Meanings of
words are more clearly identified by using semantic relations defined in WordNets and Turkish
Wikipedia. Also, during calculation of semantic relatedness of senses, cosine similarity
metric is used as an alternative metric to word overlap count. Effects of using cosine similarity
metric are observed for each WSD methods that use different knowledge bases.
|
12 |
Chinese-English cross-lingual information retrieval in biomedicine using ontology-based query expansionWang, Xinkai January 2011 (has links)
In this thesis, we propose a new approach to Chinese-English Biomedical cross-lingual information retrieval (CLIR) using query expansion based on the eCMeSH Tree, a Chinese-English ontology extended from the Chinese Medical Subject Headings (CMeSH) Tree. The CMeSH Tree is not designed for information retrieval (IR), since it only includes heading terms and has no term weighting scheme for these terms. Therefore, we design an algorithm, which employs a rule-based parsing technique combined with the C-value term extraction algorithm and a filtering technique based on mutual information, to extract Chinese synonyms for the corresponding heading terms. We also develop a term-weighting mechanism. Following the hierarchical structure of CMeSH, we extend the CMeSH Tree to the eCMeSH Tree with synonymous terms and their weights. We propose an algorithm to implement CLIR using the eCMeSH Tree terms to expand queries. In order to evaluate the retrieval improvements obtained from our approach, the results of the query expansion based on the eCMeSH Tree are individually compared with the results of the experiments of query expansion using the CMeSH Tree terms, query expansion using pseudo-relevance feedback, and document translation. We also evaluate the combinations of these three approaches. This study also investigates the factors which affect the CLIR performance, including a stemming algorithm, retrieval models, and word segmentation.
|
13 |
Conversão de voz inter-linguística / Crosslingual Voice ConversionAnderson Fraiha Machado 21 May 2013 (has links)
A conversão de voz é um problema emergente em processamento de fala e voz com um crescente interesse comercial, tanto em aplicações como Tradução Fala para Fala (Speech-to-Speech Translation - SST) e em sistemas Text-To-Speech (TTS) personalizados. Um sistema de Conversão de Voz deve permitir o mapeamento de características acústicas de sentenças pronunciadas por um falante origem para valores correspondentes da voz do falante destino, de modo que a saída processada é percebida como uma sentença pronunciada pelo falante destino. Nas últimas duas décadas, o número de contribuições cientícas relacionadas ao problema de conversão de voz tem crescido consideravelmente, e um panorama sólido do processo histórico, assim como de técnicas propostas são indispensáveis para contribuição neste campo. O objetivo deste trabalho é realizar um levantamento geral das técnicas utilizadas para resolver o problema, apontando vantagens e desvantagens de cada método, e a partir deste estudo, desenvolver novas ferramentas. Dentre as contribuições do trabalho, foram desenvolvidos um método para decomposição espectral em termos de bases radiais, mapas fonéticos articiais, agrupamentos k-verossímeis, funções de empenamento em frequência entre outras, com o intuito de implementar um sistema de conversão de voz inter-linguístico independente de texto de alta qualidade. / Voice conversion is an emergent problem in voice and speech processing with increasing commercial interest, due to applications such as Speech-to-Speech Translation (SST) and personalized Text-To-Speech (TTS) systems. A Voice Conversion system should allow the mapping of acoustical features of sentences pronounced by a source speaker to values corresponding to the voice of a target speaker, in such a way that the processed output is perceived as a sentence uttered by the target speaker. In the last two decades the number of scientic contributions to the voice conversion problem has grown considerably, and a solid overview of the historical process as well as of the proposed techniques is indispensable for those willing to contribute to the eld. The goal of this work is to provide a critical survey that combines historical presentation to technical discussion while pointing out advantages and drawbacks of each technique, and from this study, to develop new tools. Some contributions proposed in this work include a method for spectral decomposition in terms of radial basis functions, articial phonetic map, warping functions among others, in order to implement a text-independent crosslingual voice conversion system of high quality.
|
14 |
Cross-lingual and Multilingual Automatic Speech Recognition for Scandinavian LanguagesČerniavski, Rafal January 2022 (has links)
Research into Automatic Speech Recognition (ASR), the task of transforming speech into text, remains highly relevant due to its countless applications in industry and academia. State-of-the-art ASR models are able to produce nearly perfect, sometimes referred to as human-like transcriptions; however, accurate ASR models are most often available only in high-resource languages. Furthermore, the vast majority of ASR models are monolingual, that is, only able to handle one language at a time. In this thesis, we extensively evaluate the quality of existing monolingual ASR models for Swedish, Danish, and Norwegian. In addition, we search for parallels between monolingual ASR models and the cognition of foreign languages in native speakers of these languages. Lastly, we extend the Swedish monolingual model to handle all three languages. The research conducted in this thesis project is divided into two main sections, namely monolingual and multilingual models. In the former, we analyse and compare the performance of monolingual ASR models for Scandinavian languages in monolingual and cross-lingual settings. We compare these results against the levels of mutual intelligibility of Scandinavian languages in native speakers of Swedish, Danish, and Norwegian to see whether the monolingual models favour the same languages as native speakers. We also examine the performance of the monolingual models on the regional dialects of all three languages and perform qualitative analysis of the most common errors. As for multilingual models, we expand the most accurate monolingual ASR model to handle all three languages. To do so, we explore the most suitable settings via trial models. In addition, we propose an extension to the well-established Wav2Vec 2.0-CTC architecture by incorporating a language classification component. The extension enables the usage of language models, thus boosting the overall performance of the multilingual models. The results reported in this thesis suggest that in a cross-lingual setting, monolingual ASR models for Scandinavian languages perform better on the languages that are easier to comprehend for native speakers. Furthermore, the addition of a statistical language model boosts the performance of ASR models in monolingual, cross-lingual, and multilingual settings. ASR models appear to favour certain regional dialects, though the gap narrows in a multilingual setting. Contrary to our expectations, our multilingual model performs comparably with the monolingual Swedish ASR models and outperforms the Danish and Norwegian models. The multilingual architecture proposed in this thesis project is fairly simple yet effective. With greater computational resources at hand, further extensions offered in the conclusions might improve the models further.
|
15 |
Minimalism Yields Maximum Results: Deep Learning with Limited ResourceHaoyu Wang (19193416) 22 July 2024 (has links)
<p dir="ltr">Deep learning models have demonstrated remarkable success across diverse domains, including computer vision and natural language processing. These models heavily rely on resources, encompassing annotated data, computational power, and storage. However, mobile devices, particularly in scenarios like medical or multilingual contexts, often face constraints with computing power, making ample data annotation prohibitively expensive. Developing deep learning models for such resource-constrained scenarios presents a formidable challenge. Our primary goal is to enhance the efficiency of state-of-the-art neural network models tailored for resource-limited scenarios. Our commitment lies in crafting algorithms that not only mitigate annotation requirements but also reduce computational complexity and alleviate storage demands. Our dissertation focuses on two key areas: Parameter-efficient Learning and Data-efficient Learning. In Part 1, we present our studies on parameter-efficient learning. This approach targets the creation of lightweight models for efficient storage or inference. The proposed solutions are tailored for diverse tasks, including text generation, text classification, and text/image retrieval. In Part 2, we showcase our proposed methods for data-efficient learning, concentrating on cross-lingual and multi-lingual text classification applications. </p>
|
16 |
Fouille de documents et d'opinions multilingue / Mining Documents and Sentiments in Cross-lingual ContextSaad, Motaz 20 January 2015 (has links)
L’objectif de cette thèse est d’étudier les sentiments dans les documents comparables. Premièrement, nous avons recueillis des corpus comparables en anglais, français et arabe de Wikipédia et d’Euronews, et nous avons aligné ces corpus au niveau document. Nous avons en plus collecté des documents d’informations des agences de presse locales et étrangères dans les langues anglaise et arabe. Les documents en anglais ont été recueillis du site de la BBC, ceux en arabe du site d’Al-Jazzera. Deuxièmement, nous avons présenté une mesure de similarité cross-linguistique des documents dans le but de récupérer et aligner automatiquement les documents comparables. Ensuite, nous avons proposé une méthode d’annotation cross-linguistique en termes de sentiments, afin d’étiqueter les documents source et cible avec des sentiments. Enfin, nous avons utilisé des mesures statistiques pour comparer l’accord des sentiments entre les documents comparables source et cible. Les méthodes présentées dans cette thèse ne dépendent pas d’une paire de langue bien déterminée, elles peuvent être appliquées sur toute autre couple de langue / The aim of this thesis is to study sentiments in comparable documents. First, we collect English, French and Arabic comparable corpora from Wikipedia and Euronews, and we align each corpus at the document level. We further gather English-Arabic news documents from local and foreign news agencies. The English documents are collected from BBC website and the Arabic documents are collected from Al-jazeera website. Second, we present a cross-lingual document similarity measure to automatically retrieve and align comparable documents. Then, we propose a cross-lingual sentiment annotation method to label source and target documents with sentiments. Finally, we use statistical measures to compare the agreement of sentiments in the source and the target pair of the comparable documents. The methods presented in this thesis are language independent and they can be applied on any language pair
|
17 |
Computational models for multilingual negation scope detectionFancellu, Federico January 2018 (has links)
Negation is a common property of languages, in that there are few languages, if any, that lack means to revert the truth-value of a statement. A challenge to cross-lingual studies of negation lies in the fact that languages encode and use it in different ways. Although this variation has been extensively researched in linguistics, little has been done in automated language processing. In particular, we lack computational models of processing negation that can be generalized across language. We even lack knowledge of what the development of such models would require. These models however exist and can be built by means of existing cross-lingual resources, even when annotated data for a language other than English is not available. This thesis shows this in the context of detecting string-level negation scope, i.e. the set of tokens in a sentence whose meaning is affected by a negation marker (e.g. 'not'). Our contribution has two parts. First, we investigate the scenario where annotated training data is available. We show that Bi-directional Long Short Term Memory (BiLSTM) networks are state-of-the-art models whose features can be generalized across language. We also show that these models suffer from genre effects and that for most of the corpora we have experimented with, high performance is simply an artifact of the annotation styles, where negation scope is often a span of text delimited by punctuation. Second, we investigate the scenario where annotated data is available in only one language, experimenting with model transfer. To test our approach, we first build NEGPAR, a parallel corpus annotated for negation, where pre-existing annotations on English sentences have been edited and extended to Chinese translations. We then show that transferring a model for negation scope detection across languages is possible by means of structured neural models where negation scope is detected on top of a cross-linguistically consistent representation, Universal Dependencies. On the other hand, we found cross-lingual lexical information only to help very little with performance. Finally, error analysis shows that performance is better when a negation marker is in the same dependency substructure as its scope and that some of the phenomena related to negation scope requiring lexical knowledge are still not captured correctly. In the conclusions, we tie up the contributions of this thesis and we point future work towards representing negation scope across languages at the level of logical form as well.
|
18 |
Generative Adversarial Networks for Cross-Lingual Voice ConversionAnkaräng, Fredrik January 2021 (has links)
Speech synthesis is a technology that increasingly influences our daily lives, in the form of smart assistants, advanced translation systems and similar applications. In this thesis, the phenomenon of making one’s voice sound like the voice of someone else is explored. This topic is called voice conversion and needs to be done without altering the linguistic content of speech. More specifically, a Cycle-Consistent Adversarial Network that has proven to work well in a monolingual setting, is evaluated in a multilingual environment. The model is trained to convert voices between native speakers from the Nordic countries. In the experiments no parallel, transcribed or aligned speech data is being used, forcing the model to focus on the raw audio signal. The goal of the thesis is to evaluate if performance is degraded in a multilingual environment, in comparison to monolingual voice conversion, and to measure the impact of the potential performance drop. In the study, performance is measured in terms of naturalness and speaker similarity between the generated speech and the target voice. For evaluation, listening tests are conducted, as well as objective comparisons of the synthesized speech. The results show that voice conversion between a Swedish and Norwegian speaker is possible and also that it can be performed without performance degradation in comparison to Swedish-to-Swedish conversion. Furthermore, conversion between Finnish and Swedish speakers, as well as Danish and Swedish speakers show a performance drop for the generated speech. However, despite the performance decrease, the model produces fluent and clearly articulated converted speech in all experiments. These results are noteworthy, especially since the network is trained on less than 15 minutes of nonparallel speaker data for each speaker. This thesis opens up for further areas of research, for instance investigating more languages, more recent Generative Adversarial Network architectures and devoting more resources to tweaking the hyperparameters to further optimize the model for multilingual voice conversion. / Talsyntes är ett område som allt mer influerar vår vardag, exempelvis genom smarta assistenter, avancerade översättningssystem och liknande användningsområden. I det här examensarbetet utforskas fenomenet röstkonvertering, som innebär att man får en talare att låta som någon annan, utan att det som sades förändras. Mer specifikt undersöks ett Cycle-Consistent Adversarial Network som fungerat väl för röstkonvertering inom ett enskilt språk för röstkonvertering mellan olika språk. Det neurala nätverket tränas för konvertering mellan röster från olika modersmålstalare från de nordiska länderna. I experimenten används ingen parallell eller transkriberad data, vilket tvingar modellen att endast använda sig av ljudsignalen. Målet med examensarbetet är att utvärdera om modellens prestanda försämras i en flerspråkig kontext, jämfört med en enkelspråkig sådan, samt mäta hur stor försämringen i sådant fall är. I studien mäts prestanda i termer av kvalitet och talarlikhet för det genererade talet och rösten som efterliknas. För att utvärdera detta genomförs lyssningstester, samt objektiva analyser av det genererade talet. Resultaten visar att röstkonvertering mellan en svensk och norsk talare är möjlig utan att modellens prestanda försämras, jämfört med konvertering mellan svenska talare. För konvertering mellan finska och svenska talare, samt danska och svenska talare försämrades däremot kvaliteten av det genererade talet. Trots denna försämring producerade modellen tydligt och sammanhängande tal i samtliga experiment. Det här är anmärkningsvärt eftersom modellen tränades på mindre än 15 minuter icke-parallel data för varje talare. Detta examensarbete öppnar upp för nya framtida studier, exempelvis skulle fler språk kunna inkluderas eller nyare varianter av typen Generative Adversarial Network utvärderas. Mer resurser skulle även kunna läggas på att optimera hyperparametrarna för att ytterligare optimera den undersökta modellen för flerspråkig röstkonvertering.
|
19 |
Similarités textuelles sémantiques translingues : vers la détection automatique du plagiat par traduction / Cross-lingual semantic textual similarity : towards automatic cross-language plagiarism detectionFerrero, Jérémy 08 December 2017 (has links)
La mise à disposition massive de documents via Internet (pages Web, entrepôts de données,documents numériques, numérisés ou retranscrits, etc.) rend de plus en plus aisée la récupération d’idées. Malheureusement, ce phénomène s’accompagne d’une augmentation des cas de plagiat.En effet, s’approprier du contenu, peu importe sa forme, sans le consentement de son auteur (ou de ses ayants droit) et sans citer ses sources, dans le but de le présenter comme sa propre œuvre ou création est considéré comme plagiat. De plus, ces dernières années, l’expansion d’Internet a également facilité l’accès à des documents du monde entier (écrits dans des langues étrangères)et à des outils de traduction automatique de plus en plus performants, accentuant ainsi la progression d’un nouveau type de plagiat : le plagiat translingue. Ce plagiat implique l’emprunt d’un texte tout en le traduisant (manuellement ou automatiquement) de sa langue originale vers la langue du document dans lequel le plagiaire veut l’inclure. De nos jours, la prévention du plagiat commence à porter ses fruits, grâce notamment à des logiciels anti-plagiat performants qui reposent sur des techniques de comparaison monolingue déjà bien éprouvées. Néanmoins, ces derniers ne traitent pas encore de manière efficace les cas translingues. Cette thèse est née du besoin de Compilatio, une société d’édition de l’un de ces logiciels anti-plagiat, de mesurer des similarités textuelles sémantiques translingues (sous-tâche de la détection du plagiat). Après avoir défini le plagiat et les différents concepts abordés au cours de cette thèse, nous établissons un état de l’art des différentes approches de détection du plagiat translingue. Nousprésentons également les différents corpus déjà existants pour la détection du plagiat translingue et exposons les limites qu’ils peuvent rencontrer lors d’une évaluation de méthodes de détection du plagiat translingue. Nous présentons ensuite le corpus que nous avons constitué et qui ne possède pas la plupart des limites rencontrées par les différents corpus déjà existants. Nous menons,à l’aide de ce nouveau corpus, une évaluation de plusieurs méthodes de l’état de l’art et découvrons que ces dernières se comportent différemment en fonction de certaines caractéristiques des textes sur lesquelles elles opèrent. Ensuite, nous présentons des nouvelles méthodes de mesure de similarités textuelles sémantiques translingues basées sur des représentations continues de mots(word embeddings). Nous proposons également une notion de pondération morphosyntaxique et fréquentielle de mots, qui peut aussi bien être utilisée au sein d’un vecteur qu’au sein d’un sac de mots, et nous montrons que son introduction dans ces nouvelles méthodes augmente leurs performances respectives. Nous testons ensuite différents systèmes de fusion et combinaison entre différentes méthodes et étudions les performances, sur notre corpus, de ces méthodes et fusions en les comparant à celles des méthodes de l’état de l’art. Nous obtenons ainsi de meilleurs résultats que l’état de l’art dans la totalité des sous-corpus étudiés. Nous terminons en présentant et discutant les résultats de ces méthodes lors de notre participation à la tâche de similarité textuelle sémantique (STS) translingue de la campagne d’évaluation SemEval 2017, où nous nous sommes classés 1er à la sous-tâche correspondant le plus au scénario industriel de Compilatio. / The massive amount of documents through the Internet (e.g. web pages, data warehouses anddigital or transcribed texts) makes easier the recycling of ideas. Unfortunately, this phenomenonis accompanied by an increase of plagiarism cases. Indeed, claim ownership of content, withoutthe consent of its author and without crediting its source, and present it as new and original, isconsidered as plagiarism. In addition, the expansion of the Internet, which facilitates access todocuments throughout the world (written in foreign languages) as well as increasingly efficient(and freely available) machine translation tools, contribute to spread a new kind of plagiarism:cross-language plagiarism. Cross-language plagiarism means plagiarism by translation, i.e. a texthas been plagiarized while being translated (manually or automatically) from its original languageinto the language of the document in which the plagiarist wishes to include it. While prevention ofplagiarism is an active field of research and development, it covers mostly monolingual comparisontechniques. This thesis is a joint work between an academic laboratory (LIG) and Compilatio (asoftware publishing company of solutions for plagiarism detection), and proposes cross-lingualsemantic textual similarity measures, which is an important sub-task of cross-language plagiarismdetection.After defining the plagiarism and the different concepts discussed during this thesis, wepresent a state-of-the-art of the different cross-language plagiarism detection approaches. Wealso present the preexisting corpora for cross-language plagiarism detection and show their limits.Then we describe how we have gathered and built a new dataset, which does not contain mostof the limits encountered by the preexisting corpora. Using this new dataset, we conduct arigorous evaluation of several state-of-the-art methods and discover that they behave differentlyaccording to certain characteristics of the texts on which they operate. We next present newmethods for measuring cross-lingual semantic textual similarities based on word embeddings.We also propose a notion of morphosyntactic and frequency weighting of words, which can beused both within a vector and within a bag-of-words, and we show that its introduction inthe new methods increases their respective performance. Then we test different fusion systems(mostly based on linear regression). Our experiments show that we obtain better results thanthe state-of-the-art in all the sub-corpora studied. We conclude by presenting and discussingthe results of these methods obtained during our participation to the cross-lingual SemanticTextual Similarity (STS) task of SemEval-2017, where we ranked 1st on the sub-task that bestcorresponds to Compilatio’s use-case scenario.
|
20 |
Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource LanguagesBhowmik, Kowshik January 2022 (has links)
No description available.
|
Page generated in 0.0338 seconds