Spelling suggestions: "subject:"epeech technology"" "subject:"cpeech technology""
11 |
A Silent-Speech Interface using Electro-Optical StomatographyStone, Simon 21 June 2022 (has links)
Sprachtechnologie ist eine große und wachsende Industrie, die das Leben von technologieinteressierten Nutzern auf zahlreichen Wegen bereichert. Viele potenzielle Nutzer werden jedoch ausgeschlossen: Nämlich alle Sprecher, die nur schwer oder sogar gar nicht Sprache produzieren können.
Silent-Speech Interfaces bieten einen Weg, mit Maschinen durch ein bequemes sprachgesteuertes Interface zu kommunizieren ohne dafür akustische Sprache zu benötigen. Sie können außerdem prinzipiell eine Ersatzstimme stellen, indem sie die intendierten Äußerungen, die der Nutzer nur still artikuliert, künstlich synthetisieren. Diese Dissertation stellt ein neues Silent-Speech Interface vor, das auf einem neu entwickelten Messsystem namens Elektro-Optischer Stomatografie und einem neuartigen parametrischen Vokaltraktmodell basiert, das die Echtzeitsynthese von Sprache basierend auf den gemessenen Daten ermöglicht. Mit der Hardware wurden Studien zur Einzelworterkennung durchgeführt, die den Stand der Technik in der intra- und inter-individuellen Genauigkeit erreichten und übertrafen. Darüber hinaus wurde eine Studie abgeschlossen, in der die Hardware zur Steuerung des Vokaltraktmodells in einer direkten Artikulation-zu-Sprache-Synthese verwendet wurde. Während die Verständlichkeit der Synthese von Vokalen sehr hoch eingeschätzt wurde, ist die Verständlichkeit von Konsonanten und kontinuierlicher Sprache sehr schlecht. Vielversprechende Möglichkeiten zur Verbesserung des Systems werden im Ausblick diskutiert.:Statement of authorship iii
Abstract v
List of Figures vii
List of Tables xi
Acronyms xiii
1. Introduction 1
1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Fundamentals of phonetics 7
2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7
2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21
3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4. Electro-Optical Stomatography 55
4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5. Articulation-to-Text 99
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102
6. Articulation-to-Speech 109
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125
6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7. Summary and outlook 145
7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A. Overview of the International Phonetic Alphabet 151
B. Mathematical proofs and derivations 153
B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155
B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155
C. Schematics and layouts 157
C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
D. Sensor unit assembly 169
E. Firmware flow and data protocol 177
F. Palate file format 181
G. Supplemental material regarding the vocal tract model 183
H. Articulation-to-Speech: Optimal hyperparameters 189
Bibliography 191 / Speech technology is a major and growing industry that enriches the lives of technologically-minded people in a number of ways. Many potential users are, however, excluded: Namely, all speakers who cannot easily or even at all produce speech. Silent-Speech Interfaces offer a way to communicate with a machine by a convenient speech recognition interface without the need for acoustic speech. They also can potentially provide a full replacement voice by synthesizing the intended utterances that are only silently articulated by the user. To that end, the speech movements need to be captured and mapped to either text or acoustic speech. This dissertation proposes a new Silent-Speech Interface based on a newly developed measurement technology called Electro-Optical Stomatography and a novel parametric vocal tract model to facilitate real-time speech synthesis based on the measured data. The hardware was used to conduct command word recognition studies reaching state-of-the-art intra- and inter-individual performance. Furthermore, a study on using the hardware to control the vocal tract model in a direct articulation-to-speech synthesis loop was also completed. While the intelligibility of synthesized vowels was high, the intelligibility of consonants and connected speech was quite poor. Promising ways to improve the system are discussed in the outlook.:Statement of authorship iii
Abstract v
List of Figures vii
List of Tables xi
Acronyms xiii
1. Introduction 1
1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Fundamentals of phonetics 7
2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7
2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21
3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4. Electro-Optical Stomatography 55
4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5. Articulation-to-Text 99
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102
6. Articulation-to-Speech 109
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125
6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7. Summary and outlook 145
7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A. Overview of the International Phonetic Alphabet 151
B. Mathematical proofs and derivations 153
B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155
B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155
C. Schematics and layouts 157
C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
D. Sensor unit assembly 169
E. Firmware flow and data protocol 177
F. Palate file format 181
G. Supplemental material regarding the vocal tract model 183
H. Articulation-to-Speech: Optimal hyperparameters 189
Bibliography 191
|
12 |
Italianising English words with G2P techniques in TTS voices. An evaluation of different modelsGrassini, Francesco January 2024 (has links)
Text-to-speech voices have come a long way in terms of their naturalness, and they are getting closer to human-sounding than ever. However, among the problems that still persist, the pronunciation of foreign words is still one of them. The experiments conducted in this thesis focus on using grapheme-to-phoneme (G2P) models to tackle the just-mentioned issue and, more specifically, to adjust the erroneous pronunciation of English words to an Italian English accent in Italian-speaking voices. We curated a dataset of words collected during recording sessions with an Italian voice actor reading general conversational sentences. We then manually transcribed their pronunciation in Italian English. In the second stage, we augmented the dataset by collecting the most common surnames in Great Britain and the United States, phonetically transcribed them with a rule-based phoneme mapping algorithm previously deployed by the company, and then manually adjusted the pronunciations to Italian English. Thirdly, by using the massively multilingual ByT5 model, a Transformer G2P model pre-trained on 100 languages, as well as its tokenizer-dependent versions T5_base and T5_small, and an LSTM with attention based on OpenNMT, we performed 10-fold cross-validation with the curated dataset. The results show that augmenting the data benefitted every model. In terms of PER, WER and accuracy, the transformer-based ByT5_small strongly outperformed its T5_small and T5_base counterparts even with a third or two-thirds of the training data. The second best performing model, the LSTM with attention one built with the OpenNMT framework, outperformed as well the T5 models, showed the second-best accuracy of our experiments and was the 'lightest' in terms of trainable parameters (2M) in comparison to ByT5 (299M) and the T5 ones (60 and 200M).
|
13 |
Computer-based speech therapy using visual feedback with focus on children with profound hearing impairmentsÖster, Anne-Marie January 2006 (has links)
This thesis presents work in the area of computer-based speech therapy using different types of visual feedback to replace the auditory feedback channel. The study includes diagnostic assessment methods prior to therapy, type of therapy design, and type of visual feedback for different users during different stages of therapy for increasing the efficiency. The thesis focuses on individual computer-based speech therapy (CBST) for profoundly hearing-impaired children as well as for computer-assisted pronunciation training (CAPT) for teaching and training the prosody of a second language. Children who are born with a profound hearing loss have no acoustic speech target to imi¬tate and compare their own production with. Therefore, they develop no spontaneous speech but have to learn speech through vision, tactile sensation and, if possible, residual hear¬ing. They have to rely on the limited visibility of phonetic features in learning oral speech and on orosensory-motor control in maintaining speech movements. These children constitute a heterogeneous group needing an individualized speech therapy. This is because their possibilities to communicate with speech depend not only on the amount of hearing, as measured by pure-tone audiometry, but also on the quality of the hearing sensa¬tion and the use the children through training are able to make of their functional hearing for speech. Adult second language learners, on the other hand, have difficulties in perceiving the phonetics and prosody of a second language through audition, not because of a hearing loss but because they are not able to hear new sound contrasts because of interference with their native language. The thesis presents an overview of reports made concerning speech communication and profound hearing impairment such as studies about residual hearing for speech processing, effects of speech input limitations on speech production, interaction between individual deviations and speech intelligibility, and speech assessment methods of phonetic realizations of phonological systems. Finally, through several clinical evaluation studies of three Swedish computer-based therapy systems, concerning functionality, efficiency, types of visual feedback, therapy design, and practical usability for different users, important recommendations are specified for future developments.
|
14 |
Automatic speaker verification on site and by telephone: methods, applications and assessmentMelin, Håkan January 2006 (has links)
Speaker verification is the biometric task of authenticating a claimed identity by means of analyzing a spoken sample of the claimant's voice. The present thesis deals with various topics related to automatic speaker verification (ASV) in the context of its commercial applications, characterized by co-operative users, user-friendly interfaces, and requirements for small amounts of enrollment and test data. A text-dependent system based on hidden Markov models (HMM) was developed and used to conduct experiments, including a comparison between visual and aural strategies for prompting claimants for randomized digit strings. It was found that aural prompts lead to more errors in spoken responses and that visually prompted utterances performed marginally better in ASV, given that enrollment data were visually prompted. High-resolution flooring techniques were proposed for variance estimation in the HMMs, but results showed no improvement over the standard method of using target-independent variances copied from a background model. These experiments were performed on Gandalf, a Swedish speaker verification telephone corpus with 86 client speakers. A complete on-site application (PER), a physical access control system securing a gate in a reverberant stairway, was implemented based on a combination of the HMM and a Gaussian mixture model based system. Users were authenticated by saying their proper name and a visually prompted, random sequence of digits after having enrolled by speaking ten utterances of the same type. An evaluation was conducted with 54 out of 56 clients who succeeded to enroll. Semi-dedicated impostor attempts were also collected. An equal error rate (EER) of 2.4% was found for this system based on a single attempt per session and after retraining the system on PER-specific development data. On parallel telephone data collected using a telephone version of PER, 3.5% EER was found with landline and around 5% with mobile telephones. Impostor attempts in this case were same-handset attempts. Results also indicate that the distribution of false reject and false accept rates over target speakers are well described by beta distributions. A state-of-the-art commercial system was also tested on PER data with similar performance as the baseline research system. / QC 20100910
|
15 |
Computer-based speech therapy using visual feedback with focus on children with profound hearing impairmentsÖster, Anne-Marie January 2006 (has links)
<p>This thesis presents work in the area of computer-based speech therapy using different types of visual feedback to replace the auditory feedback channel. The study includes diagnostic assessment methods prior to therapy, type of therapy design, and type of visual feedback for different users during different stages of therapy for increasing the efficiency. The thesis focuses on individual computer-based speech therapy (CBST) for profoundly hearing-impaired children as well as for computer-assisted pronunciation training (CAPT) for teaching and training the prosody of a second language. Children who are born with a profound hearing loss have no acoustic speech target to imi¬tate and compare their own production with. Therefore, they develop no spontaneous speech but have to learn speech through vision, tactile sensation and, if possible, residual hear¬ing. They have to rely on the limited visibility of phonetic features in learning oral speech and on orosensory-motor control in maintaining speech movements. These children constitute a heterogeneous group needing an individualized speech therapy. This is because their possibilities to communicate with speech depend not only on the amount of hearing, as measured by pure-tone audiometry, but also on the quality of the hearing sensa¬tion and the use the children through training are able to make of their functional hearing for speech. Adult second language learners, on the other hand, have difficulties in perceiving the phonetics and prosody of a second language through audition, not because of a hearing loss but because they are not able to hear new sound contrasts because of interference with their native language. The thesis presents an overview of reports made concerning speech communication and profound hearing impairment such as studies about residual hearing for speech processing, effects of speech input limitations on speech production, interaction between individual deviations and speech intelligibility, and speech assessment methods of phonetic realizations of phonological systems. Finally, through several clinical evaluation studies of three Swedish computer-based therapy systems, concerning functionality, efficiency, types of visual feedback, therapy design, and practical usability for different users, important recommendations are specified for future developments.</p>
|
16 |
The Virtual Language Teacher : Models and applications for language learning using embodied conversational agentsWik, Preben January 2011 (has links)
This thesis presents a framework for computer assisted language learning using a virtual language teacher. It is an attempt at creating, not only a new type of language learning software, but also a server-based application that collects large amounts of speech material for future research purposes.The motivation for the framework is to create a research platform for computer assisted language learning, and computer assisted pronunciation training.Within the thesis, different feedback strategies and pronunciation error detectors are exploredThis is a broad, interdisciplinary approach, combining research from a number of scientific disciplines, such as speech-technology, game studies, cognitive science, phonetics, phonology, and second-language acquisition and teaching methodologies.The thesis discusses the paradigm both from a top-down point of view, where a number of functionally separate but interacting units are presented as part of a proposed architecture, and bottom-up by demonstrating and testing an implementation of the framework. / QC 20110511
|
17 |
Retour articulatoire visuel par échographie linguale augmentée : développements et application clinique / Augmented tongue ultrasound-based visual articulatory biofeedback : developments and clinical applicationFabre, Diandra 16 December 2016 (has links)
Dans le cadre de la rééducation orthophonique des troubles de la parole associés à un mauvais positionnement de la langue, il peut être utile au patient et à l’orthophoniste de visualiser la position et les mouvements de cet articulateur naturellement très peu visible. L’imagerie échographique peut pallier ce manque, comme en témoignent de nombreuses études de cas menées depuis plusieurs années dans les pays anglo-saxons. Appuyés par de nombreux travaux sur les liens entre production et perception de la parole, ces études font l’hypothèse que ce retour articulatoire visuel faciliterait la rééducation du patient. Lors des séances orthophoniques, le patient semble, en effet, mieux appréhender les déplacements de sa langue, malgré la difficulté d’interprétation sous-jacente de l’image échographique liée au bruit inhérent à l’image et à l’absence de vision des autres articulateurs. Nous développons dans cette thèse le concept d’échographie linguale augmentée. Nous proposons deux approches afin d’améliorer l’image échographique brute, et présentons une première application clinique de ce dispositif. La première approche porte sur le suivi du contour de la langue sur des images échographiques. Nous proposons une méthode basée sur une modélisation par apprentissage supervisé des relations entre l’intensité de l’ensemble des pixels de l’image et les coordonnées du contour de langue. Une étape de réduction de la dimension des images et des contours par analyse en composantes principales est suivie d’une étape de modélisation par réseaux de neurones. Nous déclinons des implémentations mono-locuteur et multi-locuteur de cette approche dont les performances sont évaluées en fonction de la quantité de contours manuellement annotés (données d’apprentissage). Nous obtenons pour des modèles mono-locuteur une erreur de 1,29 mm avec seulement 80 images, performance meilleure que celle de la méthode de référence EdgeTrak utilisant les contours actifs. La deuxième approche vise l’animation automatique, à partir des images échographiques, d’une tête parlante articulatoire, c’est-à-dire l’avatar d’un locuteur de référence qui révèle les structures externes comme internes de l’appareil vocal (palais, pharynx, dent, etc.). Nous construisons tout d’abord un modèle d’association entre les images échographiques et les paramètres de contrôle de la langue acquis sur ce locuteur de référence. Nous adaptons ensuite ce modèle à de nouveaux locuteurs dits locuteurs source. Pour cette adaptation, nous évaluons la technique Cascaded Gaussian Mixture Regression (C-GMR), qui s’appuie sur une modélisation conjointe des données échographiques du locuteur de référence, des paramètres de contrôle de la tête parlante, et des données échographique d’adaptation du locuteur source. Nous comparons cette approche avec une régression directe par GMR entre données du locuteur source et paramètre de contrôle de la tête parlante. Nous montrons que l’approche par C-GMR réalise le meilleur compromis entre quantité de données d’adaptation d’une part, et qualité de la prédiction d’autre part. Enfin, nous évaluons la capacité de généralisation de l’approche C-GMR et montrons que l’information a priori sur le locuteur de référence exploitée par ce modèle permet de généraliser à des configurations articulatoires du locuteur source non vues pendant la phase d’adaptation. Enfin, nous présentons les premiers résultats d’une application clinique de l’échographie augmentée à une population de patients ayant subi une ablation du plancher de la bouche ou d’une partie de la langue. Nous évaluons l’usage du retour visuel en temps réel de la langue du patient et l’usage de séquences enregistrées préalablement sur un orthophoniste pour illustrer les articulations cibles, par des bilans orthophoniques classiques pratiqués entre chaque série de séances. Les premiers résultats montrent une amélioration des performances des patients, notamment sur le placement de la langue. / In the framework of speech therapy for articulatory troubles associated with tongue misplacement, providing a visual feedback might be very useful for both the therapist and the patient, as the tongue is not a naturally visible articulator. In the last years, ultrasound imaging has been successfully applied to speech therapy in English speaking countries, as reported in several case studies. The assumption that visual articulatory biofeedback may facilitate the rehabilitation of the patient is supported by studies on the links between speech production and perception. During speech therapy sessions, the patient seems to better understand his/her tongue movements, despite the poor quality of the image due to inherent noise and the lack of information about other speech articulators. We develop in this thesis the concept of augmented lingual ultrasound. We propose two approaches to improve the raw ultrasound image, and describe a first clinical application of this device.The first approach focuses on tongue tracking in ultrasound images. We propose a method based on supervised machine learning, where we model the relationship between the intensity of all the pixels of the image and the contour coordinates. The size of the images and of the contours is reduced using a principal component analysis, and a neural network models their relationship. We developed speaker-dependent and speaker-independent implementations and evaluated the performances as a function of the amount of manually annotated contours used as training data. We obtained an error of 1.29 mm for the speaker-dependent model with only 80 annotated images, which is better than the performance of the EdgeTrak reference method based on active contours.The second approach intends to automatically animate an articulatory talking head from the ultrasound images. This talking head is the avatar of a reference speaker that reveals the external and internal structures of the vocal tract (palate, pharynx, teeth, etc.). First, we build a mapping model between ultrasound images and tongue control parameters acquired on the reference speaker. We then adapt this model to new speakers referred to as source speakers. This adaptation is performed by the Cascaded Gaussian Mixture Regression (C-GMR) technique based on a joint model of the ultrasound data of the reference speaker, control parameters of the talking head, and adaptation ultrasound data of the source speaker. This approach is compared to a direct GMR regression between the source speaker data and the control parameters of the talking head. We show that C-GMR approach achieves the best compromise between amount of adaptation data and prediction quality. We also evaluate the generalization capability of the C-GMR approach and show that prior information of the reference speaker helps the model generalize to articulatory configurations of the source speaker unseen during the adaptation phase.Finally, we present preliminary results of a clinical application of augmented ultrasound imaging to a population of patients after partial glossectomy. We evaluate the use of visual feedback of the patient’s tongue in real time and the use of sequences recorded with a speech therapist to illustrate the targeted articulation. Classical speech therapy probes are led after each series of sessions. The first results show an improvement of the patients’ performance, especially for tongue placement.
|
18 |
Inference of string mappings for speech technologyJansche, Martin 15 October 2003 (has links)
No description available.
|
19 |
Jazyková analýza vybraných mluvených projevů / Linguistic analysis of choice spoken orationsLANGOVÁ, Veronika January 2016 (has links)
The thesis on Linguistic analysis of choice speeches aims to introduce the issue of speeches. We will focus on basic building system these speeches, carry out characteristics language means in terms of phonology, morphology, syntax and lexis. The speeches will be analyzed after the linguistic and stylistic and compared.
|
20 |
Automatic Podcast Chapter Segmentation : A Framework for Implementing and Evaluating Chapter Boundary Models for Transcribed Audio Documents / Automatisk kapitelindelning för podcasts : Ett ramverk för att implementera och utvärdera segmenteringsmodeller för ljuddokumentFeldstein Jacobs, Adam January 2022 (has links)
Podcasts are an exponentially growing audio medium where useful and relevant content should be served, which requires new methods of information sorting. This thesis is the first to look into the state-of-art problem of segmenting podcasts into chapters (structurally and topically coherent sections). Podcast segmentation is a more difficult problem than segmenting structured text due to spontaneous speech and transcription errors from automatic speech recognition systems. This thesis used author-provided timestamps from podcast descriptions as labels to perform supervised learning. Binary classification is performed on sentences from podcast transcripts. A general framework is delivered for creating a dataset with 21 436 podcast episodes, training a supervised model, and for evaluation. The framework managed to address technical challenges such as a high data imbalance (there are few chapter transitions per episode), and finding an appropriate context size (how many sentences are shown to the model during inference). The proposed model outperformed a baseline model in quantitative metrics and in a human evaluation with 100 transitions. The solution provided in this thesis can be used to chapterize podcasts, which has many downstream applications, such as segment sorting, summarization, and information retrieval. / Podcasts är ett exponentiellt växande ljudmedium där användbart och relevant innehåll är viktigt, vilket kräver nya metoder för sortering av information. Detta examensarbete är det första projektet som antar utmaningen att segmentera podcasts in i kapitel (strukturellt och tematiskt sammanhängande avsnitt). Podcastsegmentering är ett svårare problem än att segmentera strukturerad text på grund av spontant tal och fel i transkriberingssystem. Detta projekt använde kapiteltider från podcastbeskrivningar som signaler för att kunna göra supervised learning. Binär klassificering görs på meningar från podcast-transkript. Denna uppsats levererar ett ramverk för att skapa ett dataset med 21 436 podcasts, träna en supervised maskininlärningsmodell samt för utvärdering. Ramverket lyckades lösa tekniska utmaningar såsom obalanserad data (det är få kapitelövergångar i varje podcast) och att hitta en rimlig kontextstorlek (hur många meningar som modellen ser för varje inferens). Den tränade modellen var bättre än en slumpmässig referensmodell i både kvantitativa mätningar samt i en mänsklig utvärdering för 100 kapitelövergångar. Slutligen, detta examensarbete har resulterat i en lösning som kan kapitelindela podcasts, vilket har många applikationer såsom sortering av segment, summering, och informationssökning.
|
Page generated in 0.0415 seconds