Spelling suggestions: "subject:"epeech synthesis"" "subject:"cpeech synthesis""
111 |
Building a prosodically sensitive diphone database for a Korean text-to-speech synthesis systemYoon, Kyuchul, January 2005 (has links)
Thesis (Ph. D.)--Ohio State University, 2005. / Title from first page of PDF file. Document formatted into pages; contains xxii, 291 p.; also includes graphics (some col.) Includes bibliographical references (p. 210-216). Available online via OhioLINK's ETD Center
|
112 |
Statistical parametric speech synthesis based on sinusoidal modelsHu, Qiong January 2017 (has links)
This study focuses on improving the quality of statistical speech synthesis based on sinusoidal models. Vocoders play a crucial role during the parametrisation and reconstruction process, so we first lead an experimental comparison of a broad range of the leading vocoder types. Although our study shows that for analysis / synthesis, sinusoidal models with complex amplitudes can generate high quality of speech compared with source-filter ones, component sinusoids are correlated with each other, and the number of parameters is also high and varies in each frame, which constrains its application for statistical speech synthesis. Therefore, we first propose a perceptually based dynamic sinusoidal model (PDM) to decrease and fix the number of components typically used in the standard sinusoidal model. Then, in order to apply the proposed vocoder with an HMM-based speech synthesis system (HTS), two strategies for modelling sinusoidal parameters have been compared. In the first method (DIR parameterisation), features extracted from the fixed- and low-dimensional PDM are statistically modelled directly. In the second method (INT parameterisation), we convert both static amplitude and dynamic slope from all the harmonics of a signal, which we term the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients (RDC)) for modelling. Our results show that HDM with intermediate parameters can generate comparable quality to STRAIGHT. As correlations between features in the dynamic model cannot be modelled satisfactorily by a typical HMM-based system with diagonal covariance, we have applied and tested a deep neural network (DNN) for modelling features from these two methods. To fully exploit DNN capabilities, we investigate ways to combine INT and DIR at the level of both DNN modelling and waveform generation. For DNN training, we propose to use multi-task learning to model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. We conclude from our results that sinusoidal models are indeed highly suited for statistical parametric synthesis. The proposed method outperforms the state-of-the-art STRAIGHT-based equivalent when used in conjunction with DNNs. To further improve the voice quality, phase features generated from the proposed vocoder also need to be parameterised and integrated into statistical modelling. Here, an alternative statistical model referred to as the complex-valued neural network (CVNN), which treats complex coefficients as a whole, is proposed to model complex amplitude explicitly. A complex-valued back-propagation algorithm using a logarithmic minimisation criterion which includes both amplitude and phase errors is used as a learning rule. Three parameterisation methods are studied for mapping text to acoustic features: RDC / real-valued log amplitude, complex-valued amplitude with minimum phase and complex-valued amplitude with mixed phase. Our results show the potential of using CVNNs for modelling both real and complex-valued acoustic features. Overall, this thesis has established competitive alternative vocoders for speech parametrisation and reconstruction. The utilisation of proposed vocoders on various acoustic models (HMM / DNN / CVNN) clearly demonstrates that it is compelling to apply them for the parametric statistical speech synthesis.
|
113 |
Statistical parametric speech synthesis using conversational data and phenomenaDall, Rasmus January 2017 (has links)
Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods.
|
114 |
Pronunciation and disfluency modeling for expressive speech synthesis / Modélisation de la prononciation et des disfluences pour la synthèse de la parole expressiveQader, Raheel 31 March 2017 (has links)
Dans la première partie de cette thèse, nous présentons une nouvelle méthode de production de variantes de prononciations qui adapte des prononciations standards, c'est-à-dire issues d'un dictionnaire, à un style spontané. Cette méthode utilise une vaste gamme d'informations linguistiques, articulatoires et acoustiques, ainsi qu'un cadre probabiliste d'apprentissage automatique, à savoir les champs aléatoires conditionnels (CAC) et les modèles de langage. Nos expériences poussées sur le corpus Buckeye démontrent l'efficacité de l'approche à travers des évaluations objectives et perceptives. Des tests d'écoutes sur de la parole synthétisée montrent que les prononciations adaptées sont jugées plus spontanées que les prononciations standards, et même que celle réalisées par les locuteurs du corpus étudié. Par ailleurs, nous montrons que notre méthode peut être étendue à d'autres tâches d'adaptation, par exemple pour résoudre des problèmes d'incohérences entre les différentes séquences de phonèmes manipulées par un système de synthèse. La seconde partie de la thèse explore une nouvelle approche de production automatique de disfluences dans les énoncés en entrée d'un système de synthèse de la parole. L'approche proposée offre l'avantage de considérer plusieurs types de disfluences, à savoir des pauses, des répétitions et des révisions. Pour cela, nous présentons une formalisation novatrice du processus de production de disfluences à travers un mécanisme de composition de ces disfluences. Nous présentons une première implémentation de notre processus, elle aussi fondée sur des CAC et des modèles de langage, puis conduisons des évaluations objectives et perceptives. Celles-ci nous permettent de conclure à la bonne fonctionnalité de notre proposition et d'en discuter les pistes principales d'amélioration. / In numerous domains, the usage of synthetic speech is conditioned upon the ability of speech synthesis systems to generate natural and expressive speech. In this frame, we address the problem of expressivity in TTS by incorporating two phenomena with a high impact on speech: pronunciation variants and speech disfluencies. In the first part of this thesis, we present a new pronunciation variant generation method which works by adapting standard i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and acoustic features and to use a probabilistic machine learning framework, namely conditional random fields (CRFs) and language models. Extensive experiments on the Buckeye corpus demonstrate the effectiveness of this approach through objective and subjective evaluations. Listening tests on synthetic speech show that adapted pronunciations are judged as more spontaneous than standard ones, as well as those realized by real speakers. Furthermore, we show that the method can be extended to other adaptation tasks, for instance, to solve the problem of inconsistency between phoneme sequences handled in TTS systems. The second part of this thesis explores a novel approach to automatic generation of speech disfluencies for TTS. Speech disfluencies are one of the most pervasive phenomena in spontaneous speech, therefore being able to automatically generate them is crucial to have more expressive synthetic speech. The proposed approach provides the advantage of generating several types of disfluencies: pauses, repetitions and revisions. To achieve this task, we formalize the problem as a theoretical process, where transformation functions are iteratively composed. We present a first implementation of the proposed process using CRFs and language models, before conducting objective and perceptual evaluations. These experiments lead to the conclusion that our proposition is effective to generate disfluencies, and highlights perspectives for future improvements.
|
115 |
Pronunciation support for Arabic learnersAlsabaan, Majed Soliman K. January 2015 (has links)
The aim of the thesis is to find out whether providing feedback to Arabic language learners will help them improve their pronunciation, particularly of words involving sounds that are not distinguished in their native languages. In addition, it aims to find out, if possible, what type of feedback will be most helpful. In order to achieve this aim, we developed a computational tool with a number of component sub tools. These tools involve the implementation of several substantial pieces of software. The first task was to ensure the system we were building could distinguish between the more challenging sounds when they were produced by a native speaker, since without that it will not be possible to classify learners’ attempts at these sounds. To this end, a number of experiments were carried out with the hidden Markov model toolkit (the HTK), a well known speech recognition toolkit, in order to ensure that it can distinguish between the confusable sounds, i.e. the ones that people have difficulty with. The developed computational tool analyses the differences between the user’s pronunciation and that of a native speaker by using grammar of minimal pairs, where each utterance is treated as coming from a family of similar words. This provides the ability to categorise learners’ errors - if someone is trying to say cat and the recogniser thinks they have said cad then it is likely that they are voicing the final consonant when it should be unvoiced. Extensive testing shows that the system can reliably distinguish such minimal pairs when they are produced by a native speaker, and that this approach does provide effective diagnostic information about errors. The tool provides feedback in three different sub-tools: as an animation of the vocal tract, as a synthesised version of the target utterance, and as a set of written instructions. The tool was evaluated by placing it in a classroom setting and asking 50 Arabic students to use the different versions of the tool. Each student had a thirty minute session with the tool, working their way through a set of pronunciation exercises at their own pace. The results of this group showed that their pronunciation does improve over the course of the session, though it was not possible to determine whether the improvement is sustained over an extended period. The evaluation was done from three points of view: quantitative analysis, qualitative analysis, and using a questionnaire. Firstly, the quantitative analysis gives raw numbers telling whether a learner had improved their pronunciation or not. Secondly, the qualitative analysis shows a behaviour pattern of what a learner did and how they used the tool. Thirdly, the questionnaire gives feedback from learners and their comments about the tool. We found that providing feedback does appear to help Arabic language learners, but we did not have enough data to see which form of feedback is most helpful. However, we provided an informative analysis of behaviour patterns to see how Arabic students used the tool and interacted with it, which could be useful for more data analysis.
|
116 |
Computer-generated speech training versus natural speech training at various task difficulty levelsFillpot, James Michael 01 January 1991 (has links)
Performance degradation -- Training from natural vs. automated voice.
|
117 |
Discrimination of “Hot Potato Voice” Caused by Upper Airway Obstruction Utilizing a Support Vector Machine / サポートベクトルマシンを用いた上気道狭窄により生ずる「含み声」の判別Fujimura, Shintaro 23 March 2020 (has links)
京都大学 / 0048 / 新制・論文博士 / 博士(医学) / 乙第13325号 / 論医博第2193号 / 新制||医||1043(附属図書館) / (主査)教授 黒田 知宏, 教授 藤渕 航, 教授 別所 和久 / 学位規則第4条第2項該当 / Doctor of Medical Science / Kyoto University / DFAM
|
118 |
Razvoj matematičkog modela trajanja glasova u automatskoj sintezi govora na srpskom jeziku / The Development of Phone Duration Model in Speech Synthesis in theSerbian LanguageSovilj-Nikić Sandra 10 July 2014 (has links)
<p>U okviru ove disertacije razvijeno je više različitih modela trajanja glasova u srpskom jeziku primenom odgovarajućih metoda automatskog učenja. Izvršena je objektivna evaluacija razvijenih modela i njihovo međusobno poređenje na osnovu kvantitativnih pokazatelja kao što su RMSE(engl. root-mean-squared error), MAE (engl. mean absolute error) i CC (engl. correlation coefficient). Takođe je izvršeno poređenje modela za srpski jezik sa performansama modela razvijenih za druge jezike, pri čemu je uočeno da su performanse modela razvijenih u ovoj disertaciji uporedljive ili čak prevazilaze performanse modela koji su razvijeni za druge jezike.</p> / <p>In this dissertation several different phone duration models of the Serbain<br />language using appropriate machine learning algorithms were developed.<br />The objective evaluation of the models obtained and their mutual comparison<br />based on quantitative measures such as RMSE (root-mean-squared error),<br />MAE (mean absolute error) and CC (correlation coefficient) were performed.<br />The comparison of the models developed for the Serbian language with the<br />performances of the models developed for other languages is also carried<br />out. It was observed that the performances of the models developed in this<br />dissertation are comparable or even outperform the performances of the<br />models that have been developed for other languages.</p>
|
119 |
Use of synthetic speech in tests of speech discriminationGordon, Jane S. 01 January 1985 (has links)
The purpose of this study was to develop two tape-recorded synthetic speech discrimination test tapes and assess their intelligibility in order to determine whether or not synthetic speech was intelligible and if it would prove useful in speech discrimination testing. Four scramblings of the second MU-6 monosyllable word list were generated by the ECHO l C speech synthesizer using two methods of generating synthetic speech called TEXTALKER and SPEAKEASY. These stimuli were presented in one ear to forty normal-hearing adult subjects, 36 females and 4 males, at 60 dB HL under headphone&. Each subject listened to two different scramblings of the 50 monosyllable word list, one scrambling generated by TEXTALKER and the other scrambling generated by SPEAKEASY. The order in which the TEXTALKER and SPEAKEASY mode of presentation occurred as well as which ear to test per subject was randomly determined.
|
120 |
The design and construction of a special purpose computer for speech synthesis-by-rule.Steingart, Robert Jay January 1976 (has links)
Thesis. 1976. M.S.--Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. / Microfiche copy available in Archives and Engineering. / Bibliography: leaves 165-167. / M.S.
|
Page generated in 0.0555 seconds