491 |
A Comparative Analysis of Whisper and VoxRex on Swedish Speech DataFredriksson, Max, Ramsay Veljanovska, Elise January 2024 (has links)
With the constant development of more advanced speech recognition models, the need to determine which models are better in specific areas and for specific purposes becomes increasingly crucial. Even more so for low-resource languages such as Swedish, dependent on the progress of models for the large international languages. Lagerlöf (2022) conducted a comparative analysis between Google’s speech-to-text model and NLoS’s VoxRex B, concluding that VoxRex was the best for Swedish audio. Since then, OpenAI released their Automatic Speech Recognition model Whisper, prompting a reassessment of the preferred choice for transcribing Swedish. In this comparative analysis using data from Swedish radio news segments, Whisper performs better than VoxRex in tests on the raw output, highly affected by more proficient sentence constructions. It is not possible to conclude which model is better regarding pure word prediction. However, the results favor VoxRex, displaying a lower variability, meaning that even though Whisper can predict full text better, the decision of what model to use should be determined by the user’s needs.
|
492 |
Röststyrning i industriella miljöer : En undersökning av ordfelsfrekvens för olika kombinationer mellan modellarkitekturer, kommandon och brusreduceringstekniker / Voice command in industrial environments : An investigation of Word Error Rate for different combinations of model architectures, commands and noise reduction techniquesEriksson, Ulrika, Hultström, Vilma January 2024 (has links)
Röststyrning som användargränssnitt kan erbjuda flera fördelar jämfört med mer traditionella styrmetoder. Det saknas dock färdiga lösningar för specifika industriella miljöer, vilka ställer särskilda krav på att korta kommandon tolkas korrekt i olika grad av buller och med begränsad eller ingen internetuppkoppling. Detta arbete ämnade undersöka potentialen för röststyrning i industriella miljöer. Ett koncepttest genomfördes där ordfelsfrekvens (på engelska Word Error Rate eller kortare WER) användes för att utvärdera träffsäkerheten för olika kombinationer av taligenkänningsarkitekturer, brusreduceringstekniker samt kommandolängder i verkliga bullriga miljöer. Undersökningen tog dessutom hänsyn till Lombard-effekten. Resultaten visar att det för samtliga testade miljöer finns god potential för röststyrning med avseende på träffsäkerheten. Framför allt visade DeepSpeech, en djupinlärd taligenkänningsmodell med rekurrent lagerstruktur, kompletterad med domänspecifika språkmodeller och en riktad kardioid-mikrofon en ordfelsfrekvens på noll procent i vissa scenarier och sällan över fem procent. Resultaten visar även att utformningen av kommandon påverkar ordfelsfrekvensen. För en verklig implementation i industriell miljö behövs ytterligare studier om säkerhetslösningar, inkluderande autentisering och hantering av risker med falskt positivt tolkade kommandon. / Voice command as a user interface can offer several advantages over more traditional control methods. However, there is a lack of ready-made solutions for specific industrial environments, which place particular demands on short commands being interpreted correctly in varying degrees of noise and with limited or no internet connection. This work aimed to investigate the potential for voice command in industrial environments. A proof of concept was conducted where Word Error Rate (WER) was used to evaluate the accuracy of various combinations of speech recognition architectures, noise reduction techniques, and command lengths in authentic noisy environments. The investigation also took into account the Lombard effect. The results indicate that for all tested environments there is good potential for voice command with regard to accuracy. In particular, DeepSpeech, a deep-learned speech recognition model with recurrent layer structure, complemented with domain-specific language models and a directional cardioid microphone, showed WER values of zero percent in certain scenarios and rarely above five percent. The results also demonstrate that the design of commands influences WER. For a real implementation in an industrial environment, further studies are needed on security solutions, including authentication and management of risks with false positive interpreted commands.
|
493 |
Multimodal interactive structured predictionAlabau Gonzalvo, Vicente 27 January 2014 (has links)
This thesis presents scientific contributions to the field of multimodal interac-
tive structured prediction (MISP). The aim of MISP is to reduce the human
effort required to supervise an automatic output, in an efficient and ergonomic
way. Hence, this thesis focuses on the two aspects of MISP systems. The first
aspect, which refers to the interactive part of MISP, is the study of strate-
gies for efficient human¿computer collaboration to produce error-free outputs.
Multimodality, the second aspect, deals with other more ergonomic modalities
of communication with the computer rather than keyboard and mouse.
To begin with, in sequential interaction the user is assumed to supervise the
output from left-to-right so that errors are corrected in sequential order. We
study the problem under the decision theory framework and define an optimum
decoding algorithm. The optimum algorithm is compared to the usually ap-
plied, standard approach. Experimental results on several tasks suggests that
the optimum algorithm is slightly better than the standard algorithm.
In contrast to sequential interaction, in active interaction it is the system that
decides what should be given to the user for supervision. On the one hand, user
supervision can be reduced if the user is required to supervise only the outputs
that the system expects to be erroneous. In this respect, we define a strategy
that retrieves first the outputs with highest expected error first. Moreover, we
prove that this strategy is optimum under certain conditions, which is validated
by experimental results. On the other hand, if the goal is to reduce the number
of corrections, active interaction works by selecting elements, one by one, e.g.,
words of a given output to be supervised by the user. For this case, several
strategies are compared. Unlike the previous case, the strategy that performs
better is to choose the element with highest confidence, which coincides with
the findings of the optimum algorithm for sequential interaction. However, this
also suggests that minimizing effort and supervision are contradictory goals.
With respect to the multimodality aspect, this thesis delves into techniques to
make multimodal systems more robust. To achieve that, multimodal systems
are improved by providing contextual information of the application at hand.
First, we study how to integrate e-pen interaction in a machine translation
task. We contribute to the state-of-the-art by leveraging the information from the source sentence. Several strategies are compared basically grouped into two
approaches: inspired by word-based translation models and n-grams generated
from a phrase-based system. The experiments show that the former outper-
forms the latter for this task. Furthermore, the results present remarkable
improvements against not using contextual information. Second, similar ex-
periments are conducted on a speech-enabled interface for interactive machine
translation. The improvements over the baseline are also noticeable. How-
ever, in this case, phrase-based models perform much better than word-based
models. We attribute that to the fact that acoustic models are poorer estima-
tions than morphologic models and, thus, they benefit more from the language
model. Finally, similar techniques are proposed for dictation of handwritten
documents. The results show that speech and handwritten recognition can be
combined in an effective way.
Finally, an evaluation with real users is carried out to compare an interactive
machine translation prototype with a post-editing prototype. The results of
the study reveal that users are very sensitive to the usability aspects of the
user interface. Therefore, usability is a crucial aspect to consider in an human
evaluation that can hinder the real benefits of the technology being evaluated.
Hopefully, once usability problems are fixed, the evaluation indicates that users
are more favorable to work with the interactive machine translation system than
to the post-editing system. / Alabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135 / Premios Extraordinarios de tesis doctorales
|
494 |
Different Contributions to Cost-Effective Transcription and Translation of Video LecturesSilvestre Cerdà, Joan Albert 05 April 2016 (has links)
[EN] In recent years, on-line multimedia repositories have experiencied a strong
growth that have made them consolidated as essential knowledge assets, especially
in the area of education, where large repositories of video lectures have been
built in order to complement or even replace traditional teaching methods.
However, most of these video lectures are neither transcribed nor translated
due to a lack of cost-effective solutions to do so in a way that gives accurate
enough results. Solutions of this kind are clearly necessary in order to make
these lectures accessible to speakers of different languages and to people with
hearing disabilities. They would also facilitate lecture searchability and
analysis functions, such as classification, recommendation or plagiarism
detection, as well as the development of advanced educational functionalities
like content summarisation to assist student note-taking.
For this reason, the main aim of this thesis is to develop a cost-effective
solution capable of transcribing and translating video lectures to a reasonable
degree of accuracy. More specifically, we address the integration of
state-of-the-art techniques in Automatic Speech Recognition and Machine
Translation into large video lecture repositories to generate high-quality
multilingual video subtitles without human intervention and at a reduced
computational cost. Also, we explore the potential benefits of the exploitation
of the information that we know a priori about these repositories, that is,
lecture-specific knowledge such as speaker, topic or slides, to create
specialised, in-domain transcription and translation systems by means of
massive adaptation techniques.
The proposed solutions have been tested in real-life scenarios by carrying out
several objective and subjective evaluations, obtaining very positive results.
The main outcome derived from this thesis, The transLectures-UPV
Platform, has been publicly released as an open-source software, and, at the
time of writing, it is serving automatic transcriptions and translations for
several thousands of video lectures in many Spanish and European
universities and institutions. / [ES] Durante estos últimos años, los repositorios multimedia on-line han experimentado un gran
crecimiento que les ha hecho establecerse como fuentes fundamentales de conocimiento,
especialmente en el área de la educación, donde se han creado grandes repositorios de vídeo
charlas educativas para complementar e incluso reemplazar los métodos de enseñanza tradicionales.
No obstante, la mayoría de estas charlas no están transcritas ni traducidas debido a
la ausencia de soluciones de bajo coste que sean capaces de hacerlo garantizando una calidad
mínima aceptable. Soluciones de este tipo son claramente necesarias para hacer que las vídeo
charlas sean más accesibles para hablantes de otras lenguas o para personas con discapacidades auditivas.
Además, dichas soluciones podrían facilitar la aplicación de funciones de
búsqueda y de análisis tales como clasificación, recomendación o detección de plagios, así
como el desarrollo de funcionalidades educativas avanzadas, como por ejemplo la generación
de resúmenes automáticos de contenidos para ayudar al estudiante a tomar apuntes.
Por este motivo, el principal objetivo de esta tesis es desarrollar una solución de bajo
coste capaz de transcribir y traducir vídeo charlas con un nivel de calidad razonable. Más
específicamente, abordamos la integración de técnicas estado del arte de Reconocimiento del
Habla Automático y Traducción Automática en grandes repositorios de vídeo charlas educativas
para la generación de subtítulos multilingües de alta calidad sin requerir intervención
humana y con un reducido coste computacional. Además, también exploramos los beneficios
potenciales que conllevaría la explotación de la información de la que disponemos a priori
sobre estos repositorios, es decir, conocimientos específicos sobre las charlas tales como el
locutor, la temática o las transparencias, para crear sistemas de transcripción y traducción
especializados mediante técnicas de adaptación masiva.
Las soluciones propuestas en esta tesis han sido testeadas en escenarios reales llevando
a cabo nombrosas evaluaciones objetivas y subjetivas, obteniendo muy buenos resultados.
El principal legado de esta tesis, The transLectures-UPV Platform, ha sido liberado públicamente
como software de código abierto, y, en el momento de escribir estas líneas, está
sirviendo transcripciones y traducciones automáticas para diversos miles de vídeo charlas
educativas en nombrosas universidades e instituciones Españolas y Europeas. / [CA] Durant aquests darrers anys, els repositoris multimèdia on-line han experimentat un gran
creixement que els ha fet consolidar-se com a fonts fonamentals de coneixement, especialment
a l'àrea de l'educació, on s'han creat grans repositoris de vídeo xarrades educatives per
tal de complementar o inclús reemplaçar els mètodes d'ensenyament tradicionals. No obstant
això, la majoria d'aquestes xarrades no estan transcrites ni traduïdes degut a l'absència de
solucions de baix cost capaces de fer-ho garantint una qualitat mínima acceptable. Solucions
d'aquest tipus són clarament necessàries per a fer que les vídeo xarres siguen més accessibles
per a parlants d'altres llengües o per a persones amb discapacitats auditives. A més, aquestes
solucions podrien facilitar l'aplicació de funcions de cerca i d'anàlisi tals com classificació,
recomanació o detecció de plagis, així com el desenvolupament de funcionalitats educatives
avançades, com per exemple la generació de resums automàtics de continguts per ajudar a
l'estudiant a prendre anotacions.
Per aquest motiu, el principal objectiu d'aquesta tesi és desenvolupar una solució de baix
cost capaç de transcriure i traduir vídeo xarrades amb un nivell de qualitat raonable. Més
específicament, abordem la integració de tècniques estat de l'art de Reconeixement de la
Parla Automàtic i Traducció Automàtica en grans repositoris de vídeo xarrades educatives
per a la generació de subtítols multilingües d'alta qualitat sense requerir intervenció humana
i amb un reduït cost computacional. A més, també explorem els beneficis potencials que
comportaria l'explotació de la informació de la que disposem a priori sobre aquests repositoris,
és a dir, coneixements específics sobre les xarrades tals com el locutor, la temàtica o
les transparències, per a crear sistemes de transcripció i traducció especialitzats mitjançant
tècniques d'adaptació massiva.
Les solucions proposades en aquesta tesi han estat testejades en escenaris reals duent a
terme nombroses avaluacions objectives i subjectives, obtenint molt bons resultats. El principal
llegat d'aquesta tesi, The transLectures-UPV Platform, ha sigut alliberat públicament
com a programari de codi obert, i, en el moment d'escriure aquestes línies, està servint transcripcions
i traduccions automàtiques per a diversos milers de vídeo xarrades educatives en
nombroses universitats i institucions Espanyoles i Europees. / Silvestre Cerdà, JA. (2016). Different Contributions to Cost-Effective Transcription and Translation of Video Lectures [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/62194
|
495 |
Transformer Models for Machine Translation and Streaming Automatic Speech RecognitionBaquero Arnal, Pau 29 May 2023 (has links)
[ES] El procesamiento del lenguaje natural (NLP) es un conjunto de problemas
computacionales con aplicaciones de máxima relevancia, que junto con otras
tecnologías informáticas se ha beneficiado de la revolución que ha significado
el aprendizaje profundo. Esta tesis se centra en dos problemas fundamentales
para el NLP: la traducción automática (MT) y el reconocimiento automático
del habla o transcripción automática (ASR); así como en una arquitectura
neuronal profunda, el Transformer, que pondremos en práctica para mejorar
las soluciones de MT y ASR en algunas de sus aplicaciones.
El ASR y MT pueden servir para obtener textos multilingües de alta calidad a
un coste razonable para una diversidad de contenidos audiovisuales. Concre-
tamente, esta tesis aborda problemas como el de traducción de noticias o el de
subtitulación automática de televisión. El ASR y MT también se pueden com-
binar entre sí, generando automáticamente subtítulos traducidos, o con otras
soluciones de NLP: resumen de textos para producir resúmenes de discursos, o
síntesis del habla para crear doblajes automáticos. Estas aplicaciones quedan
fuera del alcance de esta tesis pero pueden aprovechar las contribuciones que
contiene, en la meduda que ayudan a mejorar el rendimiento de los sistemas
automáticos de los que dependen.
Esta tesis contiene una aplicación de la arquitectura Transformer al MT tal y
como fue concebida, mediante la que obtenemos resultados de primer nivel en
traducción de lenguas semejantes. En capítulos subsecuentes, esta tesis aborda
la adaptación del Transformer como modelo de lenguaje para sistemas híbri-
dos de ASR en vivo. Posteriormente, describe la aplicación de este tipus de
sistemas al caso de uso de subtitulación de televisión, participando en una com-
petición pública de RTVE donde obtenemos la primera posición con un marge
importante. También demostramos que la mejora se debe principalmenta a la
tecnología desarrollada y no tanto a la parte de los datos. / [CA] El processament del llenguage natural (NLP) és un conjunt de problemes com-
putacionals amb aplicacions de màxima rellevància, que juntament amb al-
tres tecnologies informàtiques s'ha beneficiat de la revolució que ha significat
l'impacte de l'aprenentatge profund. Aquesta tesi se centra en dos problemes
fonamentals per al NLP: la traducció automàtica (MT) i el reconeixement
automàtic de la parla o transcripció automàtica (ASR); així com en una ar-
quitectura neuronal profunda, el Transformer, que posarem en pràctica per a
millorar les solucions de MT i ASR en algunes de les seues aplicacions.
l'ASR i MT poden servir per obtindre textos multilingües d'alta qualitat a un
cost raonable per a un gran ventall de continguts audiovisuals. Concretament,
aquesta tesi aborda problemes com el de traducció de notícies o el de subtitu-
lació automàtica de televisió. l'ASR i MT també es poden combinar entre ells,
generant automàticament subtítols traduïts, o amb altres solucions de NLP:
amb resum de textos per produir resums de discursos, o amb síntesi de la parla
per crear doblatges automàtics. Aquestes altres aplicacions es troben fora de
l'abast d'aquesta tesi però poden aprofitar les contribucions que conté, en la
mesura que ajuden a millorar els resultats dels sistemes automàtics dels quals
depenen.
Aquesta tesi conté una aplicació de l'arquitectura Transformer al MT tal com
va ser concebuda, mitjançant la qual obtenim resultats de primer nivell en
traducció de llengües semblants. En capítols subseqüents, aquesta tesi aborda
l'adaptació del Transformer com a model de llenguatge per a sistemes híbrids
d'ASR en viu. Posteriorment, descriu l'aplicació d'aquest tipus de sistemes al
cas d'ús de subtitulació de continguts televisius, participant en una competició
pública de RTVE on obtenim la primera posició amb un marge significant.
També demostrem que la millora es deu principalment a la tecnologia desen-
volupada i no tant a la part de les dades / [EN] Natural language processing (NLP) is a set of fundamental computing prob-
lems with immense applicability, as language is the natural communication
vehicle for people. NLP, along with many other computer technologies, has
been revolutionized in recent years by the impact of deep learning. This thesis
is centered around two keystone problems for NLP: machine translation (MT)
and automatic speech recognition (ASR); and a common deep neural architec-
ture, the Transformer, that is leveraged to improve the technical solutions for
some MT and ASR applications.
ASR and MT can be utilized to produce cost-effective, high-quality multilin-
gual texts for a wide array of media. Particular applications pursued in this
thesis are that of news translation or that of automatic live captioning of tele-
vision broadcasts. ASR and MT can also be combined with each other, for
instance generating automatic translated subtitles from audio, or augmented
with other NLP solutions: text summarization to produce a summary of a
speech, or speech synthesis to create an automatic translated dubbing, for in-
stance. These other applications fall out of the scope of this thesis, but can
profit from the contributions that it contains, as they help to improve the
performance of the automatic systems on which they depend.
This thesis contains an application of the Transformer architecture to MT as it
was originally conceived, achieving state-of-the-art results in similar language
translation. In successive chapters, this thesis covers the adaptation of the
Transformer as a language model for streaming hybrid ASR systems. After-
wards, it describes how we applied the developed technology for a specific use
case in television captioning by participating in a competitive challenge and
achieving the first position by a large margin. We also show that the gains
came mostly from the improvement in technology capabilities over two years
including that of the Transformer language model adapted for streaming, and
the data component was minor. / Baquero Arnal, P. (2023). Transformer Models for Machine Translation and Streaming Automatic Speech Recognition [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/193680
|
496 |
El aporte del rehablado off-line a la transcripción asistida de corpus oralesRufino Morales, Marimar 04 1900 (has links)
Cette recherche aborde un des grands défis liés à l'étude empirique des phénomènes linguistiques : l'optimisation des ressources matérielles et humaines pour la transcription. Pour ce faire, elle met en relief l’intérêt de la redite off-line, une méthode de transcription vocale à l’aide d’un logiciel de reconnaissance automatique de la parole inspirée du sous-titrage vocal pour les émissions de télé. La tâche de transcrire la parole spontanée est ardue et complexe; on doit rendre compte de tous les constituants de la communication : linguistiques, extralinguistiques et paralinguistiques, et ce, en dépit des difficultés que posent la parole spontanée, les autocorrections, les hésitations, les répétitions, les variations, les phénomènes de contact.
Afin d’évaluer le travail nécessaire pour générer un produit de qualité ont été transcrites par redite une sélection d’interviews du Corpus oral de la langue espagnole à Montréal (COLEM), qui reflète toutes les variétés d'espagnol parlées à Montréal (donc en contact avec le français et l'anglais). La qualité des transcriptions a été évaluée en fonction de leur exactitude, étant donné que plus elles sont exactes, moins le temps de correction est long. Afin d'obtenir des pourcentages d’exactitude plus fidèles à la réalité –même s’ils sont inférieurs à ceux d'autres recherches– ont été pris en compte non seulement les mots incorrectement ajoutés, supprimés ou substitués, mais aussi liées aux signes de ponctuation, aux étiquettes descriptives et aux marques typographiques propres aux conventions de transcription du COLEM. Le temps nécessaire à la production et à la correction des transcriptions a aussi été considéré. Les résultats obtenus ont été comparés à des transcriptions manuelles (dactylographiées) et à des transcriptions automatiques.
La saisie manuelle offre la flexibilité nécessaire pour obtenir le niveau d’exactitude requis pour la transcription, mais ce n'est ni la méthode la plus rapide ni la plus rigoureuse. Quant aux transcriptions automatiques, aucune ne remplit de façon satisfaisante les conditions requises pour gagner du temps ou réduire les efforts de révision. On a aussi remarqué que les performances de la reconnaissance automatique de la parole fluctuaient au gré des locuteurs et locutrices et des caractéristiques des enregistrements, causant des écarts considérables dans le temps de correction des transcriptions. Ce sont les transcriptions redites, effectuées en temps réel, qui donnent les résultats les plus stables; et celles qui ont été effectuées avec un logiciel installé sur l'ordinateur sont supérieures aux autres.
Puisqu’elle permet de minimiser la variabilité des signaux acoustiques, de fournir les indicateurs pour la représentation de la construction dialogique et de favoriser la reconnaissance automatique du vocabulaire issu de la variation de l'espagnol ainsi que d'autres langues, la méthode de redite ne demande en moyenne que 9,2 minutes par minute d'enregistrement du COLEM, incluant la redite en temps réel et deux révisions effectuées par deux personnes différentes à partir de l’audio.
En complément, les erreurs qui peuvent se manifester dans les transcriptions obtenues à l’aide de la technologie intelligente ont été catégorisées, selon qu’il s’agisse de non-respect de l'orthographe ou de la protection des données, d’imprécisions dans la segmentation des unités linguistiques, dans la représentation écrite des mécanismes d'interruption de la séquence de parole, dans la construction dialogique ou dans le lexique. / This research addresses one of the major challenges associated with the empirical study of linguistic phenomena: the optimization of material and human transcription resources. To do so, it highlights the value of off-line respeaking, a method of voice-assisted transcription using automatic speech recognition (ASR) software modelled after voice subtitling for television broadcasts. The task of transcribing spontaneous speech is an arduous and complex one; we must account for all the components of communication: linguistic, extralinguistic and paralinguistic, notwithstanding the difficulties posed by spontaneous speech, self-corrections, hesitations, repetitions, variations and contact phenomena.
To evaluate the work required to generate a quality product, a selection of interviews from the Spoken Corpus of the Spanish Language in Montreal (COLEM), which reflects all the varieties of Spanish spoken in Montreal (i.e., in contact with French and English), were transcribed through respeaking. The quality of the transcriptions was evaluated for accuracy, since the more accurate they were, the less time was needed for correction. To obtain accuracy percentages that are closer to reality –albeit lower than those obtained in other research– we considered not only words incorrectly added, deleted, or substituted, but also issues related to punctuation marks, descriptive labels, and typographical markers specific to COLEM transcription conventions. We also considered the time required to produce and correct the transcriptions. The results obtained were compared with manual (typed) and automatic transcriptions.
Manual input offers the flexibility needed to achieve the level of accuracy required for transcription, but it is neither the fastest nor the most rigorous method. As for automatic transcriptions, none fully meets the conditions required to save time or reduce editing effort. It has also been noted that the performance of automatic speech recognition fluctuates according to the speakers and the characteristics of the recordings, causing considerable variations in the time needed to correct transcriptions. The most stable results were obtained with respoken transcriptions made in real time, and those made with software installed on the computer were better than others.
Since it minimizes the variability of acoustic signals, provides indicators for the representation of dialogical construction, and promotes automatic recognition of vocabulary derived from variations in Spanish as well as other languages, respeaking requires an average of only 9.2 minutes for each minute of COLEM recording, including real-time respeaking and two revisions made from the audio by two different individuals.
In addition, the ASR errors have been categorized, depending on whether they concern misspelling or non-compliance with data protection, inaccuracies in the segmentation of linguistic units, in the written representation of speech interruption mechanisms, in dialogical construction or in the lexicon. / Esta investigación se centra en uno de los grandes retos que acompañan al estudio empírico de los fenómenos lingüísticos: la optimización de recursos materiales y humanos para transcribir. Para ello, propone el rehablado off-line, un método de transcripción vocal asistido por una herramienta de reconocimiento automático del habla (RAH) inspirado del subtitulado vocal para programas audiovisuales. La transcripción del habla espontánea es un trabajo intenso y difícil, que requiere plasmar todos los niveles de la comunicación lingüística, extralingüística y paralingüística, con sus dificultades exacerbadas por los retos propios del habla espontánea, como la autocorrección, la vacilación, la repetición, la variación o los fenómenos de contacto.
Para medir el esfuerzo que conlleva lograr un producto de calidad, primero se rehablaron una serie de grabaciones del Corpus oral de la lengua española en Montreal (COLEM), que refleja todas las variedades del español en contacto con el francés y el inglés. La calidad de las transcripciones se midió en relación con la exactitud: a mayor exactitud, menor tiempo necesario para la corrección. Se contabilizaron las palabras eliminadas, insertadas y sustituidas incorrectamente; pero también computaron los signos de puntuación, las etiquetas descriptivas y demás marcas tipográficas de las convenciones de transcripción del COLEM; los resultados serían inferiores a los de otros trabajos, pero también más realistas. Asimismo, se consideró el tiempo necesario para producir y corregir las transcripciones. Los resultados se compararon con transcripciones mecanografiadas (manuales) y automáticas.
La mecanografía brinda flexibilidad para producir el nivel de detalle de transcripción requerido, pero no es el método más rápido, ni el más exacto. Ninguna de las transcripciones automáticas reúne las condiciones satisfactorias para ganar tiempo ni disminuir esfuerzo. Además, el rendimiento de la tecnología de RAH es muy diferente para determinados hablantes y grabaciones, haciendo fluctuar excesivamente el tiempo de corrección entre una entrevista y otra. Todas las transcripciones rehabladas se hacen en tiempo real y brindan resultados más estables. Las realizadas con un programa instalado en la computadora, que puede editarse, son superiores a las demás.
Gracias a las acciones para minimizar la variación en las señales acústicas, suministrar claves de representación de la mecánica conversacional y complementar el reconocimiento automático del léxico en cualquier variedad del español, y en otras lenguas, las transcripciones de las entrevistas del COLEM se rehablaron y se revisaron dos veces con el audio por dos personas en un promedio de 9,2 minutos por minuto de grabación.
Adicionalmente, se han categorizado los errores que pueden aparecer en las transcripciones realizadas con la tecnología de RAH según sean infracciones a la ortografía o a la protección de datos, errores de segmentación de las unidades del habla, de representación gráfica de los recursos de interrupción de la cadena hablada, del andamiaje conversacional o de cualquier elemento léxico.
|
497 |
臺灣大學生透過電腦輔助軟體學習英語發音的研究 / A Passage to being understood and understanding others:蔡碧華, Tsai, Pi Hua Unknown Date (has links)
本研究旨在調查電腦輔助英語發音學習軟體 「MyET」,對學習者在學習英語發音方面的影響。 利用電腦輔助英語發音學習軟體(CAPT),練習英語的類化效果,也列為調查重點之一。 此外,學生使用CAPT過程中遭遇的困難和挑戰,以及互動過程中發展出來的對策也一一加以探討。 本研究的目的是要把CAPT在英語聲韻教學的領域中做正確的定位,並且探討如何使用其他的中介工具(例如人類)來強化此類軟體的輔助學習效果。
參與本次研究的大學生一共有九十名,分為三組:兩組CAPT組(亦即實驗組,使用CAPT獨自或與同儕一起使用CAPT學習英語發音)、非CAPT組(控制組)一 組。每組三十名。實驗開始,所有學生以十週的時間練習朗讀 從「灰姑娘」(Cinderella) 摘錄的文字,此段文字由發行 MyET 的公司線上免費提供。 實驗前與實驗後,兩組的學生各接受一次測驗。 每週練習結束後,學生必須將學習心得記載於學習日誌上;教師也針對每個學生的學習心得給予指導回饋。
研究結果顯示,兩個CAPT組別(亦即使用CAPT發音學習軟體的組別)的學生在學習英語聲韻的過程中,都有明顯及正面的進步與改變。尤其是語調與速度快慢方面的進步遠勝於發音的進步。再者,實驗組學生以十週的時間利用CAPT學習英語後,在朗讀新的文字時,無論是發音或語調都有類化的效應,但是在速度快慢方面則無顯著進步。然而,實驗結果三組的發音表現,在量化統計上並未達到明顯的差異。
雖然如此,在質化的探究上,經過分析學生的學習心得後得知:所有組別當中,獨自使用CAPT學習英語發音的組別,最能夠自我審視語言學習歷程 (包括模仿和學習樂趣)。至於共同使用CAPT學習的學生自述在英語流暢度、語調及發音方面獲致最大的改善。控制組的學生因為沒有同儕的鷹架教學及回饋,也沒有 MyET提供的練習回饋,練習過程中,學生自述學習困難的頻率最高,學生也認為學習收穫很少。 參與本次研究實驗組的學生認為, CAPT提供練習回饋的機制設計有改進的空間。 有關本研究結果在理論及英語教學上的意涵以及研究限制,於結論當中一一提出加以討論。
關鍵字:電腦輔助語言教學,語音辨識軟體,超音段,語調,時長,學習策略,
中介 / This present study investigated the impact of computer-assisted pronunciation training (CAPT) software, i.e., MyET, on students’ learning of English pronunciation. The investigation foci included the generalization of the effect of practice with the CAPT system. Also examined are the difficulties and challenges reported by the students who employed the CAPT system and the strategy scheme they developed from their interaction with the system. This study aimed to position the role of the CAPT system in the arena of instruction on English pronunciation and to investigate how other kinds of mediation, such as that of peer support, could reinforce its efficacy.
This study involved 90 Taiwanese college students, divided into two experimental groups and one control group. The two experimental groups practiced English pronunciation by using a computer-assisted pronunciation training (CAPT) program either independently or with peers while the control group only had access to MP3 files in their practice. All the groups practiced for ten weeks texts adopted from a play, Cinderella, provided by MyET free of charge on line. They all received a pretest and a posttest on the texts they had practiced and a novel text. Each week after their practice with the texts, the participants were asked to write down in their learning logs their reflections on the learning process in Chinese. In the same way, the instructor would provide her feedback on the students’ reflections in the logs every week.
The results showed that the ten-week practice with the CAPT system resulted in significant and positive changes in the learning of English pronunciation of CAPT groups (i.e., the Self-Access CAPT Group and the Collaborative CAPT Group). The progress of the participants in intonation and timing was always higher than in segmental pronunciation. Moreover, the ten-week practice with the CAPT system was found to be generalized (though the generalization is less than mediocre) to the participants’ performance in the production of segmental pronunciation and intonation but not in the timing component in reading the novel text. However, the improvement of the CAPT groups was not great enough to differentiate themselves from the MP3 Group.
Though the quantitative investigation did not reveal significant group differences, the qualitative analysis of the students’ reflections showed that the learning processes all the three groups went through differed. The Self-Access CAPT Group outperformed the other two groups in developing self-monitoring of language learning and production, and in enjoying working with the CAPT system/texts. Among the three groups, the Collaborative CAPT Group outscored the other two groups in reporting their gains and improvement in fluency, intonation and segmental pronunciation, as well as developing strategies to deal with their learning difficulty. Though the students in the MP3 group also made significant progress after the practice, without peers’ scaffolding and the feedback provided by MyET, they reported the highest frequency of difficulties and the least frequency of gains and strategies during the practice. The participants of this study also considered necessary the improvement of the CAPT system’s feedback design. At the end of the study theoretical and pedagogical implications as well as research limitations are presented.
Key words: Computer-Assisted Language Learning (CALL), Automatic Speech Recognition System (ASRS), segmental pronunciation, prosody, intonation, timing, learning strategies, mediation
|
498 |
Réseaux de neurones profonds appliqués à la compréhension de la parole / Deep learning applied to spoken langage understandingSimonnet, Edwin 12 February 2019 (has links)
Cette thèse s'inscrit dans le cadre de l'émergence de l'apprentissage profond et aborde la compréhension de la parole assimilée à l'extraction et à la représentation automatique du sens contenu dans les mots d'une phrase parlée. Nous étudions une tâche d'étiquetage en concepts sémantiques dans un contexte de dialogue oral évaluée sur le corpus français MEDIA. Depuis une dizaine d'années, les modèles neuronaux prennent l'ascendant dans de nombreuses tâches de traitement du langage naturel grâce à des avancées algorithmiques ou à la mise à disposition d'outils de calcul puissants comme les processeurs graphiques. De nombreux obstacles rendent la compréhension complexe, comme l'interprétation difficile des transcriptions automatiques de la parole étant donné que de nombreuses erreurs sont introduites par le processus de reconnaissance automatique en amont du module de compréhension. Nous présentons un état de l'art décrivant la compréhension de la parole puis les méthodes d'apprentissage automatique supervisé pour la résoudre en commençant par des systèmes classiques pour finir avec des techniques d'apprentissage profond. Les contributions sont ensuite exposées suivant trois axes. Premièrement, nous développons une architecture neuronale efficace consistant en un réseau récurent bidirectionnel encodeur-décodeur avec mécanisme d’attention. Puis nous abordons la gestion des erreurs de reconnaissance automatique et des solutions pour limiter leur impact sur nos performances. Enfin, nous envisageons une désambiguïsation de la tâche de compréhension permettant de rendre notre système plus performant. / This thesis is a part of the emergence of deep learning and focuses on spoken language understanding assimilated to the automatic extraction and representation of the meaning supported by the words in a spoken utterance. We study a semantic concept tagging task used in a spoken dialogue system and evaluated with the French corpus MEDIA. For the past decade, neural models have emerged in many natural language processing tasks through algorithmic advances or powerful computing tools such as graphics processors. Many obstacles make the understanding task complex, such as the difficult interpretation of automatic speech transcriptions, as many errors are introduced by the automatic recognition process upstream of the comprehension module. We present a state of the art describing spoken language understanding and then supervised automatic learning methods to solve it, starting with classical systems and finishing with deep learning techniques. The contributions are then presented along three axes. First, we develop an efficient neural architecture consisting of a bidirectional recurrent network encoder-decoder with attention mechanism. Then we study the management of automatic recognition errors and solutions to limit their impact on our performances. Finally, we envisage a disambiguation of the comprehension task making the systems more efficient.
|
499 |
Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems / Utilisation de modèles gaussiens pour l'adaptation au locuteur de réseaux de neurones profonds dans un contexte de modélisation acoustique pour la reconnaissance de la paroleTomashenko, Natalia 01 December 2017 (has links)
Les différences entre conditions d'apprentissage et conditions de test peuvent considérablement dégrader la qualité des transcriptions produites par un système de reconnaissance automatique de la parole (RAP). L'adaptation est un moyen efficace pour réduire l'inadéquation entre les modèles du système et les données liées à un locuteur ou un canal acoustique particulier. Il existe deux types dominants de modèles acoustiques utilisés en RAP : les modèles de mélanges gaussiens (GMM) et les réseaux de neurones profonds (DNN). L'approche par modèles de Markov cachés (HMM) combinés à des GMM (GMM-HMM) a été l'une des techniques les plus utilisées dans les systèmes de RAP pendant de nombreuses décennies. Plusieurs techniques d'adaptation ont été développées pour ce type de modèles. Les modèles acoustiques combinant HMM et DNN (DNN-HMM) ont récemment permis de grandes avancées et surpassé les modèles GMM-HMM pour diverses tâches de RAP, mais l'adaptation au locuteur reste très difficile pour les modèles DNN-HMM. L'objectif principal de cette thèse est de développer une méthode de transfert efficace des algorithmes d'adaptation des modèles GMM aux modèles DNN. Une nouvelle approche pour l'adaptation au locuteur des modèles acoustiques de type DNN est proposée et étudiée : elle s'appuie sur l'utilisation de fonctions dérivées de GMM comme entrée d'un DNN. La technique proposée fournit un cadre général pour le transfert des algorithmes d'adaptation développés pour les GMM à l'adaptation des DNN. Elle est étudiée pour différents systèmes de RAP à l'état de l'art et s'avère efficace par rapport à d'autres techniques d'adaptation au locuteur, ainsi que complémentaire. / Differences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition (ASR) systems. Adaptation is an efficient way to reduce the mismatch between models and data from a particular speaker or channel. There are two dominant types of acoustic models (AMs) used in ASR: Gaussian mixture models (GMMs) and deep neural networks (DNNs). The GMM hidden Markov model (GMM-HMM) approach has been one of the most common technique in ASR systems for many decades. Speaker adaptation is very effective for these AMs and various adaptation techniques have been developed for them. On the other hand, DNN-HMM AMs have recently achieved big advances and outperformed GMM-HMM models for various ASR tasks. However, speaker adaptation is still very challenging for these AMs. Many adaptation algorithms that work well for GMMs systems cannot be easily applied to DNNs because of the different nature of these models. The main purpose of this thesis is to develop a method for efficient transfer of adaptation algorithms from the GMM framework to DNN models. A novel approach for speaker adaptation of DNN AMs is proposed and investigated. The idea of this approach is based on using so-called GMM-derived features as input to a DNN. The proposed technique provides a general framework for transferring adaptation algorithms, developed for GMMs, to DNN adaptation. It is explored for various state-of-the-art ASR systems and is shown to be effective in comparison with other speaker adaptation techniques and complementary to them.
|
500 |
Reconhecimento de padrões aplicados à identificação de patologias de laringe / Pattern recognition applied to the identification of pathologies laryngealSodré, Bruno Ribeiro 23 February 2016 (has links)
As patologias que afetam a laringe estão aumentando consideravelmente nos últimos anos devido à condição da sociedade atual onde há hábitos não saudáveis como fumo, álcool e tabaco e um abuso vocal cada vez maior, talvez por conta do aumento da poluição sonora, principalmente nos grandes centros urbanos. Atualmente o exame utilizado pela endoscopia per-oral, direcionado a identificar patologias de laringe, são a videolaringoscopia e videoestroboscopia, ambos invasivos e por muitas vezes desconfortável ao paciente. Buscando melhorar o bem estar e minimizar o desconforto dos pacientes que necessitam submeter-se a estes procedimentos, este estudo tem como objetivo reconhecer padrões que possam ser aplicados à identificação de patologias de laringe de modo a auxiliar na criação de um novo método não invasivo em substituição ao método atual. Este trabalho utilizará várias configurações diferentes de redes neurais. A primeira rede neural foi gerada a partir de 524.287 resultados obtidos através das configurações k-k das 19 medidas acústicas disponíveis neste trabalho. Esta configuração atingiu uma acurácia de 99,5% (média de 96,99±2,08%) ao utilizar uma configuração com 11 e com 12 medidas acústicas dentre as 19 disponíveis. Utilizando-se 3 medidas rotacionadas (obtidas através do método de componentes principais), foi obtido uma acurácia de 93,98±0,24%. Com 6 medidas rotacionadas, o resultado obtido foi de acurácia foi de 94,07±0,29%. Para 6 medidas rotacionadas com entrada normalizada, a acurácia encontrada foi de 97,88±1,53%. A rede neural que fez 23 diferentes classificações, voz normal mais 22 patologias, mostrou que as melhores classificações, de acordo com a acurácia, são a da patologia hiperfunção com 58,23±18,98% e a voz normal com 52,15±18,31%. Já para a pior patologia a ser classificada, encontrou-se a fadiga vocal com 0,57±1,99%. Excluindo-se a voz normal, ou seja, utilizando uma rede neural composta somente por vozes patológicas, a hiperfunção continua sendo a mais facilmente identificável com uma acurácia de 57,3±19,55%, a segunda patologia mais facilmente identificável é a constrição ântero-posterior com 18,14±11,45%. Nesta configuração, a patologia mais difícil de se classificar continua sendo a fadiga vocal com 0,7±2,14%. A rede com re-amostragem obteve uma acurácia de 25,88±10,15% enquanto que a rede com re-amostragem e alteração de neurônios na camada intermediária obteve uma acurácia de 21,47±7,58% para 30 neurônios e uma acurácia de 18,44±6,57% para 40 neurônios. Por fim foi feita uma máquina de vetores suporte que encontrou um resultado de 67±6,2%. Assim, mostrou-se que as medidas acústicas precisam ser aprimoradas para a obtenção de melhores resultados de classificação dentre as patologias de laringe estudadas. Ainda assim, verificou-se que é possível discriminar locutores normais daqueles pacientes disfônicos. / Diseases that affect the larynx have been considerably increased in recent years due to the condition of nowadays society where there have been unhealthy habits like smoking, alcohol and tobacco and an increased vocal abuse, perhaps due to the increase in noise pollution, especially in large urban cities. Currently the exam performed by per-oral endoscopy (aimed to identify laryngeal pathologies) have been videolaryngoscopy and videostroboscopy, both invasive and often uncomfortable to the patient. Seeking to improve the comfort of the patients who need to undergo through these procedures, this study aims to identify acoustic patterns that can be applied to the identification of laryngeal pathologies in order to creating a new non-invasive larynx assessment method. Here two different configurations of neural networks were used. The first one was generated from 524.287 combinations of 19 acoustic measurements to classify voices into normal or from a diseased larynx, and achieved an max accuracy of 99.5% (96.99±2.08%). Using 3 and 6 rotated measurements (obtained from the principal components analysis method), the accuracy was 93.98±0.24% and 94.07±0.29%, respectively. With 6 rotated measurements from a previouly standardization of the 19 acoustic measurements, the accuracy was 97.88±1.53%. The second one, to classify 23 different voice types (including normal voices), showed better accuracy in identifying hiperfunctioned larynxes and normal voices, with 58.23±18.98% and 52.15±18.31%, respectively. The worst accuracy was obtained from vocal fatigues, with 0.57±1.99%. Excluding normal voices of the analysis, hyperfunctioned voices remained the most easily identifiable (with an accuracy of 57.3±19.55%) followed by anterior-posterior constriction (with 18.14±11.45%), and the most difficult condition to be identified remained vocal fatigue (with 0.7±2.14%). Re-sampling the neural networks input vectors, it was obtained accuracies of 25.88±10.15%, 21.47±7.58%, and 18.44±6.57% from such networks with 20, 30, and 40 hidden layer neurons, respectively. For comparison, classification using support vector machine produced an accuracy of 67±6.2%. Thus, it was shown that the acoustic measurements need to be improved to achieve better results of classification among the studied laryngeal pathologies. Even so, it was found that is possible to discriminate normal from dysphonic speakers.
|
Page generated in 0.1183 seconds