1 |
Exploitation du contexte sémantique pour améliorer la reconnaissance des noms propres dans les documents audio diachroniques / Exploiting Semantic and Topic Context to Improve Recognition of Proper Names in Diachronic Audio DocumentsSheikh, Imran 24 November 2016 (has links)
La nature diachronique des bulletins d'information provoque de fortes variations du contenu linguistique et du vocabulaire dans ces documents. Dans le cadre de la reconnaissance automatique de la parole, cela conduit au problème de mots hors vocabulaire (Out-Of-Vocabulary, OOV). La plupart des mots OOV sont des noms propres. Les noms propres sont très importants pour l'indexation automatique de contenus audio-vidéo. De plus, leur bonne identification est importante pour des transcriptions automatiques fiables. Le but de cette thèse est de proposer des méthodes pour récupérer les noms propres manquants dans un système de reconnaissance. Nous proposons de modéliser le contexte sémantique et d'utiliser des informations thématiques contenus dans les documents audio à transcrire. Des modèles probabilistes de thème et des projections dans un espace continu obtenues à l'aide de réseaux de neurones sont explorés pour la tâche de récupération des noms propres pertinents. Une évaluation approfondie de ces représentations contextuelles a été réalisée. Pour modéliser le contexte de nouveaux mots plus efficacement, nous proposons des réseaux de neurones qui maximisent la récupération des noms propres pertinents. En s'appuyant sur ce modèle, nous proposons un nouveau modèle (Neural Bag-of-Weighted-Words, NBOW2) qui permet d'estimer un degré d'importance pour chacun des mots du document et a la capacité de capturer des mots spécifiques à ce document. Des expériences de reconnaissance automatique de bulletins d'information télévisés montrent l'efficacité du modèle proposé. L'évaluation de NBOW2 sur d'autres tâches telles que la classification de textes montre des bonnes performances / The diachronic nature of broadcast news causes frequent variations in the linguistic content and vocabulary, leading to the problem of Out-Of-Vocabulary (OOV) words in automatic speech recognition. Most of the OOV words are found to be proper names whereas proper names are important for automatic indexing of audio-video content as well as for obtaining reliable automatic transcriptions. The goal of this thesis is to model the semantic and topical context of new proper names in order to retrieve those which are relevant to the spoken content in the audio document. Training context models is a challenging problem in this task because several new names come with a low amount of data and the context model should be robust to errors in the automatic transcription. Probabilistic topic models and word embeddings from neural network models are explored for the task of retrieval of relevant proper names. A thorough evaluation of these contextual representations is performed. It is argued that these representations, which are learned in an unsupervised manner, are not the best for the given retrieval task. Neural network context models trained with an objective to maximise the retrieval performance are proposed. The proposed Neural Bag-of-Weighted-Words (NBOW2) model learns to assign a degree of importance to input words and has the ability to capture task specific key-words. Experiments on automatic speech recognition on French broadcast news videos demonstrate the effectiveness of the proposed models. Evaluation of the NBOW2 model on standard text classification tasks shows that it learns interesting information and gives best classification accuracies among the BOW models
|
2 |
Improved Cross-language Information Retrieval via Disambiguation and Vocabulary DiscoveryZhang, Ying, ying.yzhang@gmail.com January 2007 (has links)
Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR.
|
3 |
Analyzing Large Language Models For Classifying Sexual Harassment Stories With Out-of-Vocabulary Word SubstitutionSeung Yeon Paik (18419409) 25 April 2024 (has links)
<p dir="ltr">Sexual harassment is regarded as a serious issue in society, with a particularly negative impact on young children and adolescents. Online sexual harassment has recently gained prominence as a significant number of communications have taken place online. Online sexual harassment can happen anywhere in the world because of the global nature of the internet, which transcends geographical barriers and allows people to communicate electronically. Online sexual harassment can occur in a wide variety of environments such as through work mail or chat apps in the workplace, on social media, in online communities, and in games (Chawki & El Shazly, 2013).<br>However, especially for non-native English speakers, due to cultural differences and language barriers, may vary in their understanding or interpretation of text-based sexual harassment (Welsh, Carr, MacQuarrie, & Huntley, 2006). To bridge this gap, previous studies have proposed large language models to detect and classify online sexual harassment, prompting a need to explore how language models comprehend the nuanced aspects of sexual harassment data. Prior to exploring the role of language models, it is critical to recognize the current gaps in knowledge that these models could potentially address in order to comprehend and interpret the complex nature of sexual harassment.</p><p><br></p><p dir="ltr">The Large Language Model (LLM) has attracted significant attention recently due to its exceptional performance on a broad spectrum of tasks. However, these models are characterized by being very sensitive to input data (Fujita et al., 2022; Wei, Wang, et al., 2022). Thus, the purpose of this study is to examine how various LLMs interpret data that falls under the domain of sexual harassment and how they comprehend it after replacing Out-of-Vocabulary words.</p><p dir="ltr"><br>This research examines the impact of Out-of-Vocabulary words on the performance of LLMs in classifying sexual harassment behaviors in text. The study compares the story classification abilities of cutting-edge LLM, before and after the replacement of Out-of-Vocabulary words. Through this investigation, the study provides insights into the flexibility and contextual awareness of LLMs when managing delicate narratives in the context of sexual harassment stories as well as raises awareness of sensitive social issues.</p>
|
4 |
Mining of Textual Data from the Web for Speech Recognition / Mining of Textual Data from the Web for Speech RecognitionKubalík, Jakub January 2010 (has links)
Prvotním cílem tohoto projektu bylo prostudovat problematiku jazykového modelování pro rozpoznávání řeči a techniky pro získávání textových dat z Webu. Text představuje základní techniky rozpoznávání řeči a detailněji popisuje jazykové modely založené na statistických metodách. Zvláště se práce zabývá kriterii pro vyhodnocení kvality jazykových modelů a systémů pro rozpoznávání řeči. Text dále popisuje modely a techniky dolování dat, zvláště vyhledávání informací. Dále jsou představeny problémy spojené se získávání dat z webu, a v kontrastu s tím je představen vyhledávač Google. Součástí projektu byl návrh a implementace systému pro získávání textu z webu, jehož detailnímu popisu je věnována náležitá pozornost. Nicméně, hlavním cílem práce bylo ověřit, zda data získaná z Webu mohou mít nějaký přínos pro rozpoznávání řeči. Popsané techniky se tak snaží najít optimální způsob, jak data získaná z Webu použít pro zlepšení ukázkových jazykových modelů, ale i modelů nasazených v reálných rozpoznávacích systémech.
|
Page generated in 0.0173 seconds