Spelling suggestions: "subject:"epeech synthesis"" "subject:"cpeech synthesis""
41 |
Connectionist multivariate density-estimation and its application to speech synthesisUria, Benigno January 2016 (has links)
Autoregressive models factorize a multivariate joint probability distribution into a product of one-dimensional conditional distributions. The variables are assigned an ordering, and the conditional distribution of each variable modelled using all variables preceding it in that ordering as predictors. Calculating normalized probabilities and sampling has polynomial computational complexity under autoregressive models. Moreover, binary autoregressive models based on neural networks obtain statistical performances similar to that of some intractable models, like restricted Boltzmann machines, on several datasets. The use of autoregressive probability density estimators based on neural networks to model real-valued data, while proposed before, has never been properly investigated and reported. In this thesis we extend the formulation of neural autoregressive distribution estimators (NADE) to real-valued data; a model we call the real-valued neural autoregressive density estimator (RNADE). Its statistical performance on several datasets, including visual and auditory data, is reported and compared to that of other models. RNADE obtained higher test likelihoods than other tractable models, while retaining all the attractive computational properties of autoregressive models. However, autoregressive models are limited by the ordering of the variables inherent to their formulation. Marginalization and imputation tasks can only be solved analytically if the missing variables are at the end of the ordering. We present a new training technique that obtains a set of parameters that can be used for any ordering of the variables. By choosing a model with a convenient ordering of the dimensions at test time, it is possible to solve any marginalization and imputation tasks analytically. The same training procedure also makes it practical to train NADEs and RNADEs with several hidden layers. The resulting deep and tractable models display higher test likelihoods than the equivalent one-hidden-layer models for all the datasets tested. Ensembles of NADEs or RNADEs can be created inexpensively by combining models that share their parameters but differ in the ordering of the variables. These ensembles of autoregressive models obtain state-of-the-art statistical performances for several datasets. Finally, we demonstrate the application of RNADE to speech synthesis, and confirm that capturing the phone-conditional dependencies of acoustic features improves the quality of synthetic speech. Our model generates synthetic speech that was judged by naive listeners as being of higher quality than that generated by mixture density networks, which are considered a state-of-the-art synthesis technique.
|
42 |
The contribution of phonotactic and lexical information in the segmentation of multi-word utterances /Shoaf, Lisa Contos January 2002 (has links)
No description available.
|
43 |
Speech Synthesis Utilizing Microcomputer ControlUzel, Joseph N. 01 October 1978 (has links) (PDF)
This report explores the subject of speech synthesis. Information given includes a brief explanation of speech production in man, an historical view of speech synthesis, and four types of electronic synthesizers in use today. Also included is a brief presentation on phonetics, the study of speech sounds. An understanding of this subject is necessary to see how a synthesizer must produce certain sounds, and how these sounds are put together to create words. Finally a description of a limited text speech synthesizer is presented. This system allows the user to enter English text via a keyboard and have it output in spoken form. The future of speech synthesis appears to be very bright. This report also gives some possible applications of verbal computer communication.
|
44 |
A text-to-speech synthesis system for Xitsonga using hidden Markov modelsBaloyi, Ntsako January 2012 (has links)
Thesis (M.Sc. (Computer Science) --University of Limpopo, 2013 / This research study focuses on building a general-purpose working Xitsonga speech synthesis system that is as far as can be possible reasonably intelligible, natural sounding, and flexible. The system built has to be able to model some of the desirable speaker characteristics and speaking styles. This research project forms part of the broader national speech technology project that aims at developing spoken language systems for human-machine interaction using the eleven official languages of South Africa (SA). Speech synthesis is the reverse of automatic speech recognition (which receives speech as input and converts it to text) in that it receives text as input and produces synthesized speech as output. It is generally accepted that most people find listening to spoken utterances better that reading the equivalent of such utterances.
The Xitsonga speech synthesis system has been developed using a hidden Markov model (HMM) speech synthesis method. The HMM-based speech synthesis (HTS) system synthesizes speech that is intelligible, and natural sounding. This method can synthesize speech on a footprint of only a few megabytes of training speech data. The HTS toolkit is applied as a patch to the HTK toolkit which is a hidden Markov model toolkit primarily designed for use in speech recognition to build and manipulate hidden Markov models.
|
45 |
Overcoming the limitations of statistical parametric speech synthesisMerritt, Thomas January 2017 (has links)
At the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any modelling is even performed. Therefore the hybrid synthesis paradigm is investigated. These systems aim to remove the effect of vocoding by using SPSS to inform the selection of units in a unit selection system. Both of the motivated improvements applied in Part II are found to make significant gains in synthesis quality, demonstrating the benefit of performing the style of perceptual testing conducted in the thesis.
|
46 |
A Research Bed For Unit Selection Based Text To Speech Synthesis SystemKonakanchi, Parthasarathy 02 1900 (has links) (PDF)
After trying Festival Speech Synthesis System, we decided to develop our own TTS framework, conducive to perform the necessary research experiments for developing good quality TTS for Indian languages. In most of the attempts on Indian language TTS, there is no prosody model, provision for handling foreign language words and no phrase break prediction leading to the possibility of introducing appropriate pauses in the synthesized speech. Further, in the Indian context, there is a real felt need for a bilingual TTS, involving English, along with the Indian language. In fact, it may be desirable to also have a trilingual TTS, which can also take care of the language of the neighboring state or Hindi, in addition. Thus, there is a felt need for a full-fledged TTS development framework, which lends itself for experimentation involving all the above issues and more.
This thesis work is therefore such a serious attempt to develop a modular, unit selection based TTS framework. The developed system has been tested for its effectiveness to create intelligible speech in Tamil and Kannada. The created system has also been used to carry out two research experiments on TTS.
The first part of the work is the design and development of corpus-based concatenative Tamil speech synthesizer in Matlab and C. A synthesis database has been created with 1027 phonetically rich, pre-recorded sentences, segmented at the phone level. From the sentence to be synthesized, specifications of the required target units are predicted. During synthesis, database units are selected that best match the target specification according to a distance metric and a concatenation quality metric. To accelerate matching, the features of the end frames of the database units have been precomputed and stored. The selected units are concatenated to produce synthetic speech. The high values of the obtained mean opinion scores for the TTS output reveal that speech synthesized using our TTS is intelligible and acceptably natural and can possibly be put to commercial use with some additional features. Experiments carried out by others using my TTS framework have shown that, whenever the required phonetic context is not available in the synthesis database., similar phones that are perceptually indistinguishable may be substituted.
The second part of the work deals with the design and modification of the developed TTS framework to be embedded in mobile phones. Commercial GSM FR, EFR and AMR speech codecs are used for compressing our synthesis database. Perception experiments reveal that speech synthesized using a highly compressed database is reasonably natural. This holds promise in the future to read SMSs and emails on mobile phones in Indian languages. Finally, we observe that incorporating prosody and pause models for Indian language TTS would further enhance the quality of the synthetic speech. These are some of the potential, unexplored areas ahead, for research in speech synthesis in Indian languages.
|
47 |
Modelling and extraction of fundamental frequency in speech signalsPawi, Alipah January 2014 (has links)
One of the most important parameters of speech is the fundamental frequency of vibration of voiced sounds. The audio sensation of the fundamental frequency is known as the pitch. Depending on the tonal/non-tonal category of language, the fundamental frequency conveys intonation, pragmatics and meaning. In addition the fundamental frequency and intonation carry speaker gender, age, identity, speaking style and emotional state. Accurate estimation of the fundamental frequency is critically important for functioning of speech processing applications such as speech coding, speech recognition, speech synthesis and voice morphing. This thesis makes contributions to the development of accurate pitch estimation research in three distinct ways: (1) an investigation of the impact of the window length on pitch estimation error, (2) an investigation of the use of the higher order moments and (3) an investigation of an analysis-synthesis method for selection of the best pitch value among N proposed candidates. Experimental evaluations show that the length of the speech window has a major impact on the accuracy of pitch estimation. Depending on the similarity criteria and the order of the statistical moment a window length of 37 to 80 ms gives the least error. In order to avoid excessive delay as a consequence of using a longer window, a method is proposed ii where the current short window is concatenated with the previous frames to form a longer signal window for pitch extraction. The use of second order and higher order moments, and the magnitude difference function, as the similarity criteria were explored and compared. A novel method of calculation of moments is introduced where the signal is split, i.e. rectified, into positive and negative valued samples. The moments for the positive and negative parts of the signal are computed separately and combined. The new method of calculation of moments from positive and negative parts and the higher order criteria provide competitive results. A challenging issue in pitch estimation is the determination of the best candidate from N extrema of the similarity criteria. The analysis-synthesis method proposed in this thesis selects the pitch candidate that provides the best reproduction (synthesis) of the harmonic spectrum of the original speech. The synthesis method must be such that the distortion increases with the increasing error in the estimate of the fundamental frequency. To this end a new method of spectral synthesis is proposed using an estimate of the spectral envelop and harmonically spaced asymmetric Gaussian pulses as excitation. The N-best method provides consistent reduction in pitch estimation error. The methods described in this thesis result in a significant improvement in the pitch accuracy and outperform the benchmark YIN method.
|
48 |
Intelligibility enhancement of synthetic speech in noiseValentini Botinhão, Cássia January 2013 (has links)
Speech technology can facilitate human-machine interaction and create new communication interfaces. Text-To-Speech (TTS) systems provide speech output for dialogue, notification and reading applications as well as personalized voices for people that have lost the use of their own. TTS systems are built to produce synthetic voices that should sound as natural, expressive and intelligible as possible and if necessary be similar to a particular speaker. Although naturalness is an important requirement, providing the correct information in adverse conditions can be crucial to certain applications. Speech that adapts or reacts to different listening conditions can in turn be more expressive and natural. In this work we focus on enhancing the intelligibility of TTS voices in additive noise. For that we adopt the statistical parametric paradigm for TTS in the shape of a hidden Markov model (HMM-) based speech synthesis system that allows for flexible enhancement strategies. Little is known about which human speech production mechanisms actually increase intelligibility in noise and how the choice of mechanism relates to noise type, so we approached the problem from another perspective: using mathematical models for hearing speech in noise. To find which models are better at predicting intelligibility of TTS in noise we performed listening evaluations to collect subjective intelligibility scores which we then compared to the models’ predictions. In these evaluations we observed that modifications performed on the spectral envelope of speech can increase intelligibility significantly, particularly if the strength of the modification depends on the noise and its level. We used these findings to inform the decision of which of the models to use when automatically modifying the spectral envelope of the speech according to the noise. We devised two methods, both involving cepstral coefficient modifications. The first was applied during extraction while training the acoustic models and the other when generating a voice using pre-trained TTS models. The latter has the advantage of being able to address fluctuating noise. To increase intelligibility of synthetic speech at generation time we proposed a method for Mel cepstral coefficient modification based on the glimpse proportion measure, the most promising of the models of speech intelligibility that we evaluated. An extensive series of listening experiments demonstrated that this method brings significant intelligibility gains to TTS voices while not requiring additional recordings of clear or Lombard speech. To further improve intelligibility we combined our method with noise-independent enhancement approaches based on the acoustics of highly intelligible speech. This combined solution was as effective for stationary noise as for the challenging competing speaker scenario, obtaining up to 4dB of equivalent intensity gain. Finally, we proposed an extension to the speech enhancement paradigm to account for not only energetic masking of signals but also for linguistic confusability of words in sentences. We found that word level confusability, a challenging value to predict, can be used as an additional prior to increase intelligibility even for simple enhancement methods like energy reallocation between words. These findings motivate further research into solutions that can tackle the effect of energetic masking on the auditory system as well as on higher levels of processing.
|
49 |
HMM-based speech synthesis using an acoustic glottal source modelCabral, Joao P. January 2011 (has links)
Parametric speech synthesis has received increased attention in recent years following the development of statistical HMM-based speech synthesis. However, the speech produced using this method still does not sound as natural as human speech and there is limited parametric flexibility to replicate voice quality aspects, such as breathiness. The hypothesis of this thesis is that speech naturalness and voice quality can be more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal source model, the Liljencrants-Fant (LF) model, to represent the source component of speech instead of the traditional impulse train. Two different analysis-synthesis methods were developed during this thesis, in order to integrate the LF-model into a baseline HMM-based speech synthesiser, which is based on the popular HTS system and uses the STRAIGHT vocoder. The first method, which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model signal through a glottal post-filter to obtain the source signal and then generating speech, by passing this source signal through the spectral envelope filter. The system which uses the GPF method (HTS-GPF system) is similar to the baseline system, but it uses a different source signal instead of the impulse train used by STRAIGHT. The second method, called Glottal Spectral Separation (GSS), generates speech by passing the LF-model signal through the vocal tract filter. The major advantage of the synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic properties of the LF-model parameters are automatically learnt by the HMMs. In this thesis, an initial perceptual experiment was conducted to compare the LFmodel to the impulse train. The results showed that the LF-model was significantly better, both in terms of speech naturalness and replication of two basic voice qualities (breathy and tense). In a second perceptual evaluation, the HTS-LF system was better than the baseline system, although the difference between the two had been expected to be more significant. A third experiment was conducted to evaluate the HTS-GPF system and an improved HTS-LF system, in terms of speech naturalness, voice similarity and intelligibility. The results showed that the HTS-GPF system performed similarly to the baseline. However, the HTS-LF system was significantly outperformed by the baseline. Finally, acoustic measurements were performed on the synthetic speech to investigate the speech distortion in the HTS-LF system. The results indicated that a problem in replicating the rapid variations of the vocal tract filter parameters at transitions between voiced and unvoiced sounds is the most significant cause of speech distortion. This problem encourages future work to further improve the system.
|
50 |
An acoustically-driven vocal tract model for stop consonant productionStory, Brad H., Bunton, Kate 03 1900 (has links)
The purpose of this study was to further develop a multi-tier model of the vocal tract area function in which the modulations of shape to produce speech are generated by the product of a vowel substrate and a consonant superposition function. The new approach consists of specifying input parameters for a target consonant as a set of directional changes in the resonance frequencies of the vowel substrate. Using calculations of acoustic sensitivity functions, these "resonance deflection patterns" are transformed into time-varying deformations of the vocal tract shape without any direct specification of location or extent of the consonant constriction along the vocal tract. The configuration of the constrictions and expansions that are generated by this process were shown to be physiologically-realistic and produce speech sounds that are easily identifiable as the target consonants. This model is a useful enhancement for area function-based synthesis and can serve as a tool for understanding how the vocal tract is shaped by a talker during speech production. (C) 2016 Elsevier B.V. All rights reserved.
|
Page generated in 0.0852 seconds