• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 8
  • 3
  • 2
  • 1
  • Tagged with
  • 15
  • 15
  • 15
  • 15
  • 6
  • 5
  • 5
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Joint Evaluation Of Multiple Speech Patterns For Speech Recognition And Training

Nair, Nishanth Ulhas 01 1900 (has links)
Improving speech recognition performance in the presence of noise and interference continues to be a challenging problem. Automatic Speech Recognition (ASR) systems work well when the test and training conditions match. In real world environments there is often a mismatch between testing and training conditions. Various factors like additive noise, acoustic echo, and speaker accent, affect the speech recognition performance. Since ASR is a statistical pattern recognition problem, if the test patterns are unlike anything used to train the models, errors are bound to occur, due to feature vector mismatch. Various approaches to robustness have been proposed in the ASR literature contributing to mainly two topics: (i) reducing the variability in the feature vectors or (ii) modify the statistical model parameters to suit the noisy condition. While some of those techniques are quite effective, we would like to examine robustness from a different perspective. Considering the analogy of human communication over telephones, it is quite common to ask the person speaking to us, to repeat certain portions of their speech, because we don't understand it. This happens more often in the presence of background noise where the intelligibility of speech is affected significantly. Although exact nature of how humans decode multiple repetitions of speech is not known, it is quite possible that we use the combined knowledge of the multiple utterances and decode the unclear part of speech. Majority of ASR algorithms do not address this issue, except in very specific issues such as pronunciation modeling. We recognize that under very high noise conditions or bursty error channels, such as in packet communication where packets get dropped, it would be beneficial to take the approach of repeated utterances for robust ASR. In this thesis, we have formulated a set of algorithms for both joint evaluation/decoding for recognizing noisy test utterances as well as utilize the same formulation for selective training of Hidden Markov Models (HMMs), again for robust performance. We first address joint recognition of multiple speech patterns given that they belong to the same class. We formulated this problem considering the patterns as isolated words. If there are K test patterns (K ≥ 2) of a word by a speaker, we show that it is possible to improve the speech recognition accuracy over independent single pattern evaluation of test speech, for the case of both clean and noisy speech. We also find the state sequence which best represents the K patterns. This formulation can be extended to connected word recognition or continuous speech recognition also. Next, we consider the benefits of joint multi-pattern likelihood for HMM training. In the usual HMM training, all the training data is utilized to arrive at a best possible parametric model. But, it is possible that the training data is not all genuine and therefore may have labeling errors, noise corruptions, or plain outlier exemplars. Such outliers will result in poorer models and affect speech recognition performance. So it is important to selectively train them so that the outliers get a lesser weightage. Giving lesser weight to an entire outlier pattern has been addressed before in speech recognition literature. However, it is possible that only some portions of a training pattern are corrupted. So it is important that only the corrupted portions of speech are given a lesser weight during HMM training and not the entire pattern. Since in HMM training, multiple patterns of speech from each class are used, we show that it is possible to use joint evaluation methods to selectively train HMMs such that only the corrupted portions of speech are given a lesser weight and not the entire speech pattern. Thus, we have addressed all the three main tasks of a HMM, to jointly utilize the availability of multiple patterns belonging to the same class. We experimented the new algorithms for Isolated Word Recognition in the case of both clean speech and noisy speech. Significant improvement in speech recognition performance is obtained, especially for speech affected by transient/burst noise.
12

Mispronunciation Detection with SpeechBlender Data Augmentation Pipeline / Uttalsfelsdetektering med SpeechBlender data-förstärkning

Elkheir, Yassine January 2023 (has links)
The rise of multilingualism has fueled the demand for computer-assisted pronunciation training (CAPT) systems for language learning, CAPT systems make use of speech technology advancements and offer features such as learner assessment and curriculum management. Mispronunciation detection (MD) is a crucial aspect of CAPT, aimed at identifying and correcting mispronunciations in second language learners’ speech. One of the significant challenges in developing MD models is the limited availability of labeled second-language speech data. To overcome this, the thesis introduces SpeechBlender - a fine-grained data augmentation pipeline designed to generate mispronunciations. The SpeechBlender targets different regions of a phonetic unit and blends raw speech signals through linear interpolation, resulting in erroneous pronunciation instances. This method provides a more effective sample generation compared to traditional cut/paste methods. The thesis explores also the use of pre-trained automatic speech recognition (ASR) systems for mispronunciation detection (MD), and examines various phone-level features that can be extracted from pre-trained ASR models and utilized for MD tasks. An deep neural model was proposed, that enhance the representations of extracted acoustic features combined with positional phoneme embeddings. The efficacy of the augmentation technique is demonstrated through a phone-level pronunciation quality assessment task using only non-native good pronunciation speech data. Our proposed technique achieves state-of-the-art results, with Speechocean762 Dataset [54], on ASR dependent MD models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [17]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. In this thesis, we developed the first Arabic pronunciation learning corpus for Arabic AraVoiceL2 to demonstrate the generality of our proposed model and augmentation technique. We used the corpus to evaluate the effectiveness of our approach in improving mispronunciation detection for non-native Arabic speakers learning. Our experiments showed promising results, with a 4.6% increase in F1-score for the Arabic AraVoiceL2 testset, demonstrating the effectiveness of our model and augmentation technique in improving pronunciation learning for non-native speakers of Arabic. / Den ökande flerspråkigheten har ökat efterfrågan på datorstödda CAPT-system (Computer-assisted pronunciation training) för språkinlärning. CAPT-systemen utnyttjar taltekniska framsteg och erbjuder funktioner som bedömning av inlärare och läroplanshantering. Upptäckt av felaktigt uttal är en viktig aspekt av CAPT som syftar till att identifiera och korrigera felaktiga uttal i andraspråkselevernas tal. En av de stora utmaningarna när det gäller att utveckla MD-modeller är den begränsade tillgången till märkta taldata för andraspråk. För att övervinna detta introduceras SpeechBlender i avhandlingen - en finkornig dataförstärkningspipeline som är utformad för att generera feluttalningar. SpeechBlender är inriktad på olika regioner i en fonetisk enhet och blandar råa talsignaler genom linjär interpolering, vilket resulterar i felaktiga uttalsinstanser. Denna metod ger en effektivare provgenerering jämfört med traditionella cut/paste-metoder. I avhandlingen undersöks användningen av förtränade system för automatisk taligenkänning (ASR) för upptäckt av felaktigt uttal. I avhandlingen undersöks olika funktioner på fonemnivå som kan extraheras från förtränade ASR-modeller och användas för att upptäcka felaktigt uttal. En LSTM-modell föreslogs som förbättrar representationen av extraherade akustiska egenskaper i kombination med positionella foneminbäddningar. Effektiviteten hos förstärkning stekniken demonstreras genom en uppgift för bedömning av uttalskvaliteten på fonemnivå med hjälp av taldata som endast innehåller taldata som inte är av inhemskt ursprung och som ger ett bra uttal, Vår föreslagna teknik uppnår toppresultat med Speechocean762-dataset [54], på ASR-beroende modeller för upptäckt av felaktigt uttal på fonemnivå, med en ökning av Pearsonkorrelationskoefficienten (PCC) med 2,0% jämfört med den tidigare toppnivån [17]. Dessutom visar vi en förbättring på 5,0% på fonemnivå jämfört med vår baslinje. Vi observerade också en ökning av F1-poängen med 4,6% med arabiska AraVoiceL2-testset.
13

Transformer Models for Machine Translation and Streaming Automatic Speech Recognition

Baquero Arnal, Pau 29 May 2023 (has links)
[ES] El procesamiento del lenguaje natural (NLP) es un conjunto de problemas computacionales con aplicaciones de máxima relevancia, que junto con otras tecnologías informáticas se ha beneficiado de la revolución que ha significado el aprendizaje profundo. Esta tesis se centra en dos problemas fundamentales para el NLP: la traducción automática (MT) y el reconocimiento automático del habla o transcripción automática (ASR); así como en una arquitectura neuronal profunda, el Transformer, que pondremos en práctica para mejorar las soluciones de MT y ASR en algunas de sus aplicaciones. El ASR y MT pueden servir para obtener textos multilingües de alta calidad a un coste razonable para una diversidad de contenidos audiovisuales. Concre- tamente, esta tesis aborda problemas como el de traducción de noticias o el de subtitulación automática de televisión. El ASR y MT también se pueden com- binar entre sí, generando automáticamente subtítulos traducidos, o con otras soluciones de NLP: resumen de textos para producir resúmenes de discursos, o síntesis del habla para crear doblajes automáticos. Estas aplicaciones quedan fuera del alcance de esta tesis pero pueden aprovechar las contribuciones que contiene, en la meduda que ayudan a mejorar el rendimiento de los sistemas automáticos de los que dependen. Esta tesis contiene una aplicación de la arquitectura Transformer al MT tal y como fue concebida, mediante la que obtenemos resultados de primer nivel en traducción de lenguas semejantes. En capítulos subsecuentes, esta tesis aborda la adaptación del Transformer como modelo de lenguaje para sistemas híbri- dos de ASR en vivo. Posteriormente, describe la aplicación de este tipus de sistemas al caso de uso de subtitulación de televisión, participando en una com- petición pública de RTVE donde obtenemos la primera posición con un marge importante. También demostramos que la mejora se debe principalmenta a la tecnología desarrollada y no tanto a la parte de los datos. / [CA] El processament del llenguage natural (NLP) és un conjunt de problemes com- putacionals amb aplicacions de màxima rellevància, que juntament amb al- tres tecnologies informàtiques s'ha beneficiat de la revolució que ha significat l'impacte de l'aprenentatge profund. Aquesta tesi se centra en dos problemes fonamentals per al NLP: la traducció automàtica (MT) i el reconeixement automàtic de la parla o transcripció automàtica (ASR); així com en una ar- quitectura neuronal profunda, el Transformer, que posarem en pràctica per a millorar les solucions de MT i ASR en algunes de les seues aplicacions. l'ASR i MT poden servir per obtindre textos multilingües d'alta qualitat a un cost raonable per a un gran ventall de continguts audiovisuals. Concretament, aquesta tesi aborda problemes com el de traducció de notícies o el de subtitu- lació automàtica de televisió. l'ASR i MT també es poden combinar entre ells, generant automàticament subtítols traduïts, o amb altres solucions de NLP: amb resum de textos per produir resums de discursos, o amb síntesi de la parla per crear doblatges automàtics. Aquestes altres aplicacions es troben fora de l'abast d'aquesta tesi però poden aprofitar les contribucions que conté, en la mesura que ajuden a millorar els resultats dels sistemes automàtics dels quals depenen. Aquesta tesi conté una aplicació de l'arquitectura Transformer al MT tal com va ser concebuda, mitjançant la qual obtenim resultats de primer nivell en traducció de llengües semblants. En capítols subseqüents, aquesta tesi aborda l'adaptació del Transformer com a model de llenguatge per a sistemes híbrids d'ASR en viu. Posteriorment, descriu l'aplicació d'aquest tipus de sistemes al cas d'ús de subtitulació de continguts televisius, participant en una competició pública de RTVE on obtenim la primera posició amb un marge significant. També demostrem que la millora es deu principalment a la tecnologia desen- volupada i no tant a la part de les dades / [EN] Natural language processing (NLP) is a set of fundamental computing prob- lems with immense applicability, as language is the natural communication vehicle for people. NLP, along with many other computer technologies, has been revolutionized in recent years by the impact of deep learning. This thesis is centered around two keystone problems for NLP: machine translation (MT) and automatic speech recognition (ASR); and a common deep neural architec- ture, the Transformer, that is leveraged to improve the technical solutions for some MT and ASR applications. ASR and MT can be utilized to produce cost-effective, high-quality multilin- gual texts for a wide array of media. Particular applications pursued in this thesis are that of news translation or that of automatic live captioning of tele- vision broadcasts. ASR and MT can also be combined with each other, for instance generating automatic translated subtitles from audio, or augmented with other NLP solutions: text summarization to produce a summary of a speech, or speech synthesis to create an automatic translated dubbing, for in- stance. These other applications fall out of the scope of this thesis, but can profit from the contributions that it contains, as they help to improve the performance of the automatic systems on which they depend. This thesis contains an application of the Transformer architecture to MT as it was originally conceived, achieving state-of-the-art results in similar language translation. In successive chapters, this thesis covers the adaptation of the Transformer as a language model for streaming hybrid ASR systems. After- wards, it describes how we applied the developed technology for a specific use case in television captioning by participating in a competitive challenge and achieving the first position by a large margin. We also show that the gains came mostly from the improvement in technology capabilities over two years including that of the Transformer language model adapted for streaming, and the data component was minor. / Baquero Arnal, P. (2023). Transformer Models for Machine Translation and Streaming Automatic Speech Recognition [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/193680
14

El aporte del rehablado off-line a la transcripción asistida de corpus orales

Rufino Morales, Marimar 04 1900 (has links)
Cette recherche aborde un des grands défis liés à l'étude empirique des phénomènes linguistiques : l'optimisation des ressources matérielles et humaines pour la transcription. Pour ce faire, elle met en relief l’intérêt de la redite off-line, une méthode de transcription vocale à l’aide d’un logiciel de reconnaissance automatique de la parole inspirée du sous-titrage vocal pour les émissions de télé. La tâche de transcrire la parole spontanée est ardue et complexe; on doit rendre compte de tous les constituants de la communication : linguistiques, extralinguistiques et paralinguistiques, et ce, en dépit des difficultés que posent la parole spontanée, les autocorrections, les hésitations, les répétitions, les variations, les phénomènes de contact. Afin d’évaluer le travail nécessaire pour générer un produit de qualité ont été transcrites par redite une sélection d’interviews du Corpus oral de la langue espagnole à Montréal (COLEM), qui reflète toutes les variétés d'espagnol parlées à Montréal (donc en contact avec le français et l'anglais). La qualité des transcriptions a été évaluée en fonction de leur exactitude, étant donné que plus elles sont exactes, moins le temps de correction est long. Afin d'obtenir des pourcentages d’exactitude plus fidèles à la réalité –même s’ils sont inférieurs à ceux d'autres recherches– ont été pris en compte non seulement les mots incorrectement ajoutés, supprimés ou substitués, mais aussi liées aux signes de ponctuation, aux étiquettes descriptives et aux marques typographiques propres aux conventions de transcription du COLEM. Le temps nécessaire à la production et à la correction des transcriptions a aussi été considéré. Les résultats obtenus ont été comparés à des transcriptions manuelles (dactylographiées) et à des transcriptions automatiques. La saisie manuelle offre la flexibilité nécessaire pour obtenir le niveau d’exactitude requis pour la transcription, mais ce n'est ni la méthode la plus rapide ni la plus rigoureuse. Quant aux transcriptions automatiques, aucune ne remplit de façon satisfaisante les conditions requises pour gagner du temps ou réduire les efforts de révision. On a aussi remarqué que les performances de la reconnaissance automatique de la parole fluctuaient au gré des locuteurs et locutrices et des caractéristiques des enregistrements, causant des écarts considérables dans le temps de correction des transcriptions. Ce sont les transcriptions redites, effectuées en temps réel, qui donnent les résultats les plus stables; et celles qui ont été effectuées avec un logiciel installé sur l'ordinateur sont supérieures aux autres. Puisqu’elle permet de minimiser la variabilité des signaux acoustiques, de fournir les indicateurs pour la représentation de la construction dialogique et de favoriser la reconnaissance automatique du vocabulaire issu de la variation de l'espagnol ainsi que d'autres langues, la méthode de redite ne demande en moyenne que 9,2 minutes par minute d'enregistrement du COLEM, incluant la redite en temps réel et deux révisions effectuées par deux personnes différentes à partir de l’audio. En complément, les erreurs qui peuvent se manifester dans les transcriptions obtenues à l’aide de la technologie intelligente ont été catégorisées, selon qu’il s’agisse de non-respect de l'orthographe ou de la protection des données, d’imprécisions dans la segmentation des unités linguistiques, dans la représentation écrite des mécanismes d'interruption de la séquence de parole, dans la construction dialogique ou dans le lexique. / This research addresses one of the major challenges associated with the empirical study of linguistic phenomena: the optimization of material and human transcription resources. To do so, it highlights the value of off-line respeaking, a method of voice-assisted transcription using automatic speech recognition (ASR) software modelled after voice subtitling for television broadcasts. The task of transcribing spontaneous speech is an arduous and complex one; we must account for all the components of communication: linguistic, extralinguistic and paralinguistic, notwithstanding the difficulties posed by spontaneous speech, self-corrections, hesitations, repetitions, variations and contact phenomena. To evaluate the work required to generate a quality product, a selection of interviews from the Spoken Corpus of the Spanish Language in Montreal (COLEM), which reflects all the varieties of Spanish spoken in Montreal (i.e., in contact with French and English), were transcribed through respeaking. The quality of the transcriptions was evaluated for accuracy, since the more accurate they were, the less time was needed for correction. To obtain accuracy percentages that are closer to reality –albeit lower than those obtained in other research– we considered not only words incorrectly added, deleted, or substituted, but also issues related to punctuation marks, descriptive labels, and typographical markers specific to COLEM transcription conventions. We also considered the time required to produce and correct the transcriptions. The results obtained were compared with manual (typed) and automatic transcriptions. Manual input offers the flexibility needed to achieve the level of accuracy required for transcription, but it is neither the fastest nor the most rigorous method. As for automatic transcriptions, none fully meets the conditions required to save time or reduce editing effort. It has also been noted that the performance of automatic speech recognition fluctuates according to the speakers and the characteristics of the recordings, causing considerable variations in the time needed to correct transcriptions. The most stable results were obtained with respoken transcriptions made in real time, and those made with software installed on the computer were better than others. Since it minimizes the variability of acoustic signals, provides indicators for the representation of dialogical construction, and promotes automatic recognition of vocabulary derived from variations in Spanish as well as other languages, respeaking requires an average of only 9.2 minutes for each minute of COLEM recording, including real-time respeaking and two revisions made from the audio by two different individuals. In addition, the ASR errors have been categorized, depending on whether they concern misspelling or non-compliance with data protection, inaccuracies in the segmentation of linguistic units, in the written representation of speech interruption mechanisms, in dialogical construction or in the lexicon. / Esta investigación se centra en uno de los grandes retos que acompañan al estudio empírico de los fenómenos lingüísticos: la optimización de recursos materiales y humanos para transcribir. Para ello, propone el rehablado off-line, un método de transcripción vocal asistido por una herramienta de reconocimiento automático del habla (RAH) inspirado del subtitulado vocal para programas audiovisuales. La transcripción del habla espontánea es un trabajo intenso y difícil, que requiere plasmar todos los niveles de la comunicación lingüística, extralingüística y paralingüística, con sus dificultades exacerbadas por los retos propios del habla espontánea, como la autocorrección, la vacilación, la repetición, la variación o los fenómenos de contacto. Para medir el esfuerzo que conlleva lograr un producto de calidad, primero se rehablaron una serie de grabaciones del Corpus oral de la lengua española en Montreal (COLEM), que refleja todas las variedades del español en contacto con el francés y el inglés. La calidad de las transcripciones se midió en relación con la exactitud: a mayor exactitud, menor tiempo necesario para la corrección. Se contabilizaron las palabras eliminadas, insertadas y sustituidas incorrectamente; pero también computaron los signos de puntuación, las etiquetas descriptivas y demás marcas tipográficas de las convenciones de transcripción del COLEM; los resultados serían inferiores a los de otros trabajos, pero también más realistas. Asimismo, se consideró el tiempo necesario para producir y corregir las transcripciones. Los resultados se compararon con transcripciones mecanografiadas (manuales) y automáticas. La mecanografía brinda flexibilidad para producir el nivel de detalle de transcripción requerido, pero no es el método más rápido, ni el más exacto. Ninguna de las transcripciones automáticas reúne las condiciones satisfactorias para ganar tiempo ni disminuir esfuerzo. Además, el rendimiento de la tecnología de RAH es muy diferente para determinados hablantes y grabaciones, haciendo fluctuar excesivamente el tiempo de corrección entre una entrevista y otra. Todas las transcripciones rehabladas se hacen en tiempo real y brindan resultados más estables. Las realizadas con un programa instalado en la computadora, que puede editarse, son superiores a las demás. Gracias a las acciones para minimizar la variación en las señales acústicas, suministrar claves de representación de la mecánica conversacional y complementar el reconocimiento automático del léxico en cualquier variedad del español, y en otras lenguas, las transcripciones de las entrevistas del COLEM se rehablaron y se revisaron dos veces con el audio por dos personas en un promedio de 9,2 minutos por minuto de grabación. Adicionalmente, se han categorizado los errores que pueden aparecer en las transcripciones realizadas con la tecnología de RAH según sean infracciones a la ortografía o a la protección de datos, errores de segmentación de las unidades del habla, de representación gráfica de los recursos de interrupción de la cadena hablada, del andamiaje conversacional o de cualquier elemento léxico.
15

Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems / Utilisation de modèles gaussiens pour l'adaptation au locuteur de réseaux de neurones profonds dans un contexte de modélisation acoustique pour la reconnaissance de la parole

Tomashenko, Natalia 01 December 2017 (has links)
Les différences entre conditions d'apprentissage et conditions de test peuvent considérablement dégrader la qualité des transcriptions produites par un système de reconnaissance automatique de la parole (RAP). L'adaptation est un moyen efficace pour réduire l'inadéquation entre les modèles du système et les données liées à un locuteur ou un canal acoustique particulier. Il existe deux types dominants de modèles acoustiques utilisés en RAP : les modèles de mélanges gaussiens (GMM) et les réseaux de neurones profonds (DNN). L'approche par modèles de Markov cachés (HMM) combinés à des GMM (GMM-HMM) a été l'une des techniques les plus utilisées dans les systèmes de RAP pendant de nombreuses décennies. Plusieurs techniques d'adaptation ont été développées pour ce type de modèles. Les modèles acoustiques combinant HMM et DNN (DNN-HMM) ont récemment permis de grandes avancées et surpassé les modèles GMM-HMM pour diverses tâches de RAP, mais l'adaptation au locuteur reste très difficile pour les modèles DNN-HMM. L'objectif principal de cette thèse est de développer une méthode de transfert efficace des algorithmes d'adaptation des modèles GMM aux modèles DNN. Une nouvelle approche pour l'adaptation au locuteur des modèles acoustiques de type DNN est proposée et étudiée : elle s'appuie sur l'utilisation de fonctions dérivées de GMM comme entrée d'un DNN. La technique proposée fournit un cadre général pour le transfert des algorithmes d'adaptation développés pour les GMM à l'adaptation des DNN. Elle est étudiée pour différents systèmes de RAP à l'état de l'art et s'avère efficace par rapport à d'autres techniques d'adaptation au locuteur, ainsi que complémentaire. / Differences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition (ASR) systems. Adaptation is an efficient way to reduce the mismatch between models and data from a particular speaker or channel. There are two dominant types of acoustic models (AMs) used in ASR: Gaussian mixture models (GMMs) and deep neural networks (DNNs). The GMM hidden Markov model (GMM-HMM) approach has been one of the most common technique in ASR systems for many decades. Speaker adaptation is very effective for these AMs and various adaptation techniques have been developed for them. On the other hand, DNN-HMM AMs have recently achieved big advances and outperformed GMM-HMM models for various ASR tasks. However, speaker adaptation is still very challenging for these AMs. Many adaptation algorithms that work well for GMMs systems cannot be easily applied to DNNs because of the different nature of these models. The main purpose of this thesis is to develop a method for efficient transfer of adaptation algorithms from the GMM framework to DNN models. A novel approach for speaker adaptation of DNN AMs is proposed and investigated. The idea of this approach is based on using so-called GMM-derived features as input to a DNN. The proposed technique provides a general framework for transferring adaptation algorithms, developed for GMMs, to DNN adaptation. It is explored for various state-of-the-art ASR systems and is shown to be effective in comparison with other speaker adaptation techniques and complementary to them.

Page generated in 0.1565 seconds