Spelling suggestions: "subject:"epeech synthesis"" "subject:"cpeech synthesis""
81 |
A parametric monophone speech synthesis systemKlompje, Gideon 12 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2006. / Speech is the primary and most natural means of communication between human beings.
With the rapid spread of technology across the globe and the increased number of personal
and public applications for digital equipment in recent years, the need for human/machine
interaction has increased dramatically. Synthetic speech is audible speech produced by a
machine automatically. A text-to-speech (TTS) system is one that converts bodies of text
into digital speech signals which can be heard and understood by a person.
Current TTS systems generally require large annotated speech corpora in the languages
for which they are developed. For many languages these resources are not available. In their
absence, a TTS system generates synthetic speech by means of mathematical algorithms
constrained by certain rules.
This thesis describes the design and implementation of a rule-based speech generation
algorithm for use in a TTS system. The system allows the type, emphasis, pitch and other
parameters associated with a sound and its particular mode of articulation to be specified.
However, no attempt is made to model prosodic and other higher-level information. Instead,
this is assumed known. The algorithm uses linear predictive (LP) models of monophone
speech units, which greatly reduces the amount of data required for development in a new
language. A novel approach to the interpolation of monophone speech units is presented
to allow realistic transitions between monophone units. Additionally, novel algorithms for
estimation and modelling of the harmonic and stochastic content of an excitation signal are
presented. This is used to determine the amount of voiced and unvoiced energy present in
individual speech sounds.
Promising results were obtained when evaluating the developed system’s South African
English speech output using two widely used speech intelligibility tests, namely the modified
rhyme test (MRT) and semantically unpredictable sentences (SUS).
|
82 |
Developing an enriched natural language grammar for prosodically-improved concent-to-speech synthesisMarais, Laurette 04 1900 (has links)
The need for interacting with machines using spoken natural language is growing,
along with the expectation that synthetic speech in this context sound
natural. Such interaction includes answering questions, where prosody plays an
important role in producing natural English synthetic speech by communicating
the information structure of utterances.
CCG is a theoretical framework that exploits the notion that, in English, information
structure, prosodic structure and syntactic structure are isomorphic.
This provides a way to convert a semantic representation of an utterance into
a prosodically natural spoken utterance. GF is a framework for writing grammars,
where abstract tree structures capture the semantic structure and concrete
grammars render these structures in linearised strings. This research combines
these frameworks to develop a system that converts semantic representations
of utterances into linearised strings of natural language that are marked up to
inform the prosody-generating component of a speech synthesis system. / Computing / M. Sc. (Computing)
|
83 |
Synthèse de parole expressive à partir du texte : Des phonostyles au contrôle gestuel pour la synthèse paramétrique statistique / Expressive Text-to-Speech Synthesis : From Phonostyles to Gestural Control for Parametric Statistic SynthesisEvrard, Marc 30 September 2015 (has links)
L’objectif de cette thèse est l’étude et la conception d’une plateforme de synthèse de parole expressive.Le système de synthèse — LIPS3, développé dans le cadre de ce travail, incorpore deux éléments : un module linguistique et un module de synthèse paramétrique par apprentissage statistique (construit à l’aide de HTS et de STRAIGHT). Le système s’appuie sur un corpus monolocuteur conçu, enregistréet étiqueté à cette occasion.Une première étude, sur l’influence de la qualité de l’étiquetage du corpus d’apprentissage, indique que la synthèse paramétrique statistique est robuste aux erreurs de labels et d’alignement. Cela répond au problème de la variation des réalisations phonétiques en parole expressive.Une seconde étude, sur l’analyse acoustico-phonétique du corpus permet la caractérisation de l’espace expressif utilisé par la locutrice pour réaliser les consignes expressives qui lui ont été fournies. Les paramètres de source et les paramètres articulatoires sont analysés suivant les classes phonétiques, ce qui permet une caractérisation fine des phonostyles.Une troisième étude porte sur l’intonation et le rythme. Calliphony 2.0 est une interface de contrôlechironomique temps-réel permettant la modification de paramètres prosodiques (f0 et tempo) des signaux de synthèse sans perte de qualité, via une manipulation directe de ces paramètres. Une étude sur la stylisation de l’intonation et du rythme par contrôle gestuel montre que cette interface permet l’amélioration, non-seulement de la qualité expressive de la parole synthétisée, mais aussi de la qualité globale perçue en comparaison avec la modélisation statistique de la prosodie.Ces études montrent que la synthèse paramétrique, combinée à une interface chironomique, offre une solution performante pour la synthèse de la parole expressive, ainsi qu’un outil d’expérimentation puissant pour l’étude de la prosodie. / The subject of this thesis was the study and conception of a platform for expressive speech synthesis.The LIPS3 Text-to-Speech system — developed in the context of this thesis — includes a linguistic module and a parametric statistical module (built upon HTS and STRAIGHT). The system was based on a new single-speaker corpus, designed, recorded and annotated.The first study analyzed the influence of the precision of the training corpus phonetic labeling on the synthesis quality. It showed that statistical parametric synthesis is robust to labeling and alignment errors. This addresses the issue of variation in phonetic realizations for expressive speech.The second study presents an acoustico-phonetic analysis of the corpus, characterizing the expressive space used by the speaker to instantiate the instructions that described the different expressive conditions. Voice source parameters and articulatory settings were analyzed according to their phonetic classes, which allowed for a fine phonostylistic characterization.The third study focused on intonation and rhythm. Calliphony 2.0 is a real-time chironomic interface that controls the f0 and rhythmic parameters of prosody, using drawing/writing hand gestures with a stylus and a graphic tablet. These hand-controlled modulations are used to enhance the TTS output, producing speech that is more realistic, without degradation as it is directly applied to the vocoder parameters. Intonation and rhythm stylization using this interface brings significant improvement to the prototypicality of expressivity, as well as to the general quality of synthetic speech.These studies show that parametric statistical synthesis, combined with a chironomic interface, offers an efficient solution for expressive speech synthesis, as well as a powerful tool for the study of prosody.
|
84 |
Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesisFonseca De Sam Bento Ribeiro, Manuel January 2018 (has links)
Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially in terms of intelligibility. Synthetic speech is often clear and understandable, but it can also be bland and monotonous. Proper generation of natural speech prosody is still a largely unsolved problem. This is relevant especially in the context of expressive audiobook speech synthesis, where speech is expected to be fluid and captivating. In general, prosody can be seen as a layer that is superimposed on the segmental (phone) sequence. Listeners can perceive the same melody or rhythm in different utterances, and the same segmental sequence can be uttered with a different prosodic layer to convey a different message. For this reason, prosody is commonly accepted to be inherently suprasegmental. It is governed by longer units within the utterance (e.g. syllables, words, phrases) and beyond the utterance (e.g. discourse). However, common techniques for the modeling of speech prosody - and speech in general - operate mainly on very short intervals, either at the state or frame level, in both hidden Markov model (HMM) and deep neural network (DNN) based speech synthesis. This thesis presents contributions supporting the claim that stronger representations of suprasegmental variation are essential for the natural generation of fundamental frequency for statistical parametric speech synthesis. We conceptualize the problem by dividing it into three sub-problems: (1) representations of acoustic signals, (2) representations of linguistic contexts, and (3) the mapping of one representation to another. The contributions of this thesis provide novel methods and insights relating to these three sub-problems. In terms of sub-problem 1, we propose a multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform, as well as a wavelet-based decomposition strategy that is linguistically and perceptually motivated. In terms of sub-problem 2, we investigate additional linguistic features such as text-derived word embeddings and syllable bag-of-phones and we propose a novel method for learning word vector representations based on acoustic counts. Finally, considering sub-problem 3, insights are given regarding hierarchical models such as parallel and cascaded deep neural networks.
|
85 |
Developing an enriched natural language grammar for prosodically-improved concent-to-speech synthesisMarais, Laurette 04 1900 (has links)
The need for interacting with machines using spoken natural language is growing,
along with the expectation that synthetic speech in this context sound
natural. Such interaction includes answering questions, where prosody plays an
important role in producing natural English synthetic speech by communicating
the information structure of utterances.
CCG is a theoretical framework that exploits the notion that, in English, information
structure, prosodic structure and syntactic structure are isomorphic.
This provides a way to convert a semantic representation of an utterance into
a prosodically natural spoken utterance. GF is a framework for writing grammars,
where abstract tree structures capture the semantic structure and concrete
grammars render these structures in linearised strings. This research combines
these frameworks to develop a system that converts semantic representations
of utterances into linearised strings of natural language that are marked up to
inform the prosody-generating component of a speech synthesis system. / Computing / M. Sc. (Computing)
|
86 |
Study of unit selection text-to-speech synthesis algorithms / Étude des algorithmes de sélection d’unités pour la synthèse de la parole à partir du texteGuennec, David 22 September 2016 (has links)
La synthèse de la parole par corpus (sélection d'unités) est le sujet principal de cette thèse. Tout d'abord, une analyse approfondie et un diagnostic de l'algorithme de sélection d'unités (algorithme de recherche dans le treillis d'unités) sont présentés. L'importance de l'optimalité de la solution est discutée et une nouvelle mise en œuvre de la sélection basée sur un algorithme A* est présenté. Trois améliorations de la fonction de coût sont également présentées. La première est une nouvelle façon – dans le coût cible – de minimiser les différences spectrales en sélectionnant des séquences d'unités minimisant un coût moyen au lieu d'unités minimisant chacune un coût cible de manière absolue. Ce coût est testé pour une distance sur la durée phonémique mais peut être appliqué à d'autres distances. Notre deuxième proposition est une fonction de coût cible visant à améliorer l'intonation en se basant sur des coefficients extraits à travers une version généralisée du modèle de Fujisaki. Les paramètres de ces fonctions sont utilisés au sein d'un coût cible. Enfin, notre troisième contribution concerne un système de pénalités visant à améliorer le coût de concaténation. Il pénalise les unités en fonction de classes reposant sur une hiérarchie du degré de risque qu'un artefact de concaténation se produise lors de la concaténation sur un phone de cette classe. Ce système est différent des autres dans la littérature en cela qu'il est tempéré par une fonction floue capable d'adoucir le système de pénalités pour les unités présentant des coûts de concaténation parmi les plus bas de leur distribution. / This PhD thesis focuses on the automatic speech synthesis field, and more specifically on unit selection. A deep analysis and a diagnosis of the unit selection algorithm (lattice search algorithm) is provided. The importance of the solution optimality is discussed and a new unit selection implementation based on a A* algorithm is presented. Three cost function enhancements are also presented. The first one is a new way – in the target cost – to minimize important spectral differences by selecting sequences of candidate units that minimize a mean cost instead of an absolute one. This cost is tested on a phonemic duration distance but can be applied to others. Our second proposition is a target sub-cost addressing intonation that is based on coefficients extracted through a generalized version of Fujisaki's command-response model. This model features gamma functions modeling F0 called atoms. Finally, our third contribution concerns a penalty system that aims at enhancing the concatenation cost. It penalizes units in function of classes defining the risk a concatenation artifact occurs when concatenating on a phone of this class. This system is different to others in the literature in that it is tempered by a fuzzy function that allows to soften penalties for units presenting low concatenation costs.
|
87 |
Tone realisation for speech synthesis of Yorùbá / Daniel Rudolph van NiekerkVan Niekerk, Daniel Rudolph January 2014 (has links)
Speech technologies such as text-to-speech synthesis (TTS) and automatic speech recognition (ASR) have recently generated much interest in the developed world as a user-interface medium to smartphones [1, 2]. However, it is also recognised that these technologies may potentially have a positive impact on the lives of those in the developing world, especially in Africa, by presenting an important medium for access to information where illiteracy and a lack of infrastructure play a limiting role [3, 4, 5, 6]. While these technologies continually experience important advances that keep extending their applicability to new and under-resourced languages, one particular area in need of further development is speech synthesis of African tone languages [7, 8]. The main objective of this work is acoustic modelling and synthesis of tone for an African tone,language: Yorùbá. We present an empirical investigation to establish the acoustic properties of tone in Yorùbá, and to evaluate resulting models integrated into a Hidden Markov model-based (HMMbased) TTS system. We show that in Yorùbá, which is considered a register tone language, the realisation of tone is not solely determined by pitch levels, but also inter-syllable and intra-syllable pitch dynamics. Furthermore, our experimental results indicate that utterance-wide pitch patterns are not only a result of cumulative local pitch changes (terracing), but do contain a significant gradual declination component. Lastly, models based on inter- and intra-syllable pitch dynamics using underlying linear pitch targets are shown to be relatively efficient and perceptually preferable to the current standard approach in statistical parametric speech synthesis employing HMM pitch models based on context-dependent phones. These findings support the applicability of the proposed models in under-resourced conditions. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
88 |
Tone realisation for speech synthesis of Yorùbá / Daniel Rudolph van NiekerkVan Niekerk, Daniel Rudolph January 2014 (has links)
Speech technologies such as text-to-speech synthesis (TTS) and automatic speech recognition (ASR) have recently generated much interest in the developed world as a user-interface medium to smartphones [1, 2]. However, it is also recognised that these technologies may potentially have a positive impact on the lives of those in the developing world, especially in Africa, by presenting an important medium for access to information where illiteracy and a lack of infrastructure play a limiting role [3, 4, 5, 6]. While these technologies continually experience important advances that keep extending their applicability to new and under-resourced languages, one particular area in need of further development is speech synthesis of African tone languages [7, 8]. The main objective of this work is acoustic modelling and synthesis of tone for an African tone,language: Yorùbá. We present an empirical investigation to establish the acoustic properties of tone in Yorùbá, and to evaluate resulting models integrated into a Hidden Markov model-based (HMMbased) TTS system. We show that in Yorùbá, which is considered a register tone language, the realisation of tone is not solely determined by pitch levels, but also inter-syllable and intra-syllable pitch dynamics. Furthermore, our experimental results indicate that utterance-wide pitch patterns are not only a result of cumulative local pitch changes (terracing), but do contain a significant gradual declination component. Lastly, models based on inter- and intra-syllable pitch dynamics using underlying linear pitch targets are shown to be relatively efficient and perceptually preferable to the current standard approach in statistical parametric speech synthesis employing HMM pitch models based on context-dependent phones. These findings support the applicability of the proposed models in under-resourced conditions. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
89 |
Wavelet-based speech enhancement : a statistical approachHarmse, Wynand 12 1900 (has links)
Thesis (MScIng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: Speech enhancement is the process of removing background noise from speech signals. The
equivalent process for images is known as image denoising. While the Fourier transform is
widely used for speech enhancement, image denoising typically uses the wavelet transform.
Research on wavelet-based speech enhancement has only recently emerged, yet it shows
promising results compared to Fourier-based methods. This research is enhanced by the
availability of new wavelet denoising algorithms based on the statistical modelling of
wavelet coefficients, such as the hidden Markov tree.
The aim of this research project is to investigate wavelet-based speech enhancement from
a statistical perspective. Current Fourier-based speech enhancement and its evaluation
process are described, and a framework is created for wavelet-based speech enhancement.
Several wavelet denoising algorithms are investigated, and it is found that the algorithms
based on the statistical properties of speech in the wavelet domain outperform the classical
and more heuristic denoising techniques. The choice of wavelet influences the quality of the
enhanced speech and the effect of this choice is therefore examined. The introduction of a
noise floor parameter also improves the perceptual quality of the wavelet-based enhanced
speech, by masking annoying residual artifacts. The performance of wavelet-based speech
enhancement is similar to that of the more widely used Fourier methods at low noise
levels, with a slight difference in the residual artifact. At high noise levels, however, the
Fourier methods are superior. / AFRIKAANSE OPSOMMING: Spraaksuiwering is die proses waardeur agtergrondgeraas uit spraakseine verwyder word.
Die ekwivalente proses vir beelde word beeldsuiwering genoem. Terwyl spraaksuiwering in
die algemeen in die Fourier-domein gedoen word, gebruik beeldsuiwering tipies die golfietransform.
Navorsing oor golfie-gebaseerde spraaksuiwering het eers onlangs verskyn, en
dit toon reeds belowende resultate in vergelyking met Fourier-gebaseerde metodes. Hierdie
navorsingsveld word aangehelp deur die beskikbaarheid van nuwe golfie-gebaseerde suiweringstegnieke
wat die golfie-ko¨effisi¨ente statisties modelleer, soos die verskuilde Markovboom.
Die doel van hierdie navorsingsprojek is om golfie-gebaseerde spraaksuiwering vanuit ‘n
statistiese oogpunt te bestudeer. Huidige Fourier-gebaseerde spraaksuiweringsmetodes
asook die evalueringsproses vir sulke algoritmes word bespreek, en ‘n raamwerk word
geskep vir golfie-gebaseerde spraaksuiwering. Verskeie golfie-gebaseerde algoritmes word
ondersoek, en daar word gevind dat die metodes wat die statistiese eienskappe van spraak
in die golfie-gebied gebruik, beter vaar as die klassieke en meer heuristiese metodes. Die
keuse van golfie be¨ınvloed die kwaliteit van die gesuiwerde spraak, en die effek van hierdie
keuse word dus ondersoek. Die gebruik van ‘n ruisvloer parameter verhoog ook
die kwaliteit van die golfie-gesuiwerde spraak, deur steurende residuele artifakte te verberg.
Die golfie-metodes vaar omtrent dieselfde as die klassieke Fourier-metodes by lae
ruisvlakke, met ’n klein verskil in residuele artifakte. By ho¨e ruisvlakke vaar die Fouriermetodes
egter steeds beter.
|
90 |
Speech generation in a spoken dialogue systemVisagie, Albertus Sybrand 12 1900 (has links)
Thesis (MScIng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: Spoken dialogue systems accessed over the telephone network are rapidly becoming more
popular as a means to reduce call-centre costs and improve customer experience. It is
now technologically feasible to delegate repetitive and relatively simple tasks conducted
in most telephone calls to automatic systems. Such a system uses speech recognition to
take input from users. This work focuses on the speech generation component that a
specific prototype system uses to convey audible speech output back to the user.
Many commercial systems contain general text-to-speech synthesisers. Text-to-speech
synthesis is a very active branch of speech processing. It aims to build machines that
read text aloud. In some languages this has been a reality for almost two decades. While
these synthesisers are often very understandable, they almost never sound natural. The
output quality of synthetic speech is considered to be a very important factor in the user’s
perception of the quality and usability of spoken dialogue systems.
The static nature of the spoken dialogue system is exploited to produce a custom
speech synthesis component that provides very high quality output speech for the particular
application. To this end the current state of the art in speech synthesis is surveyed
and summarised. A unit-selection synthesiser is produced that functions in Afrikaans,
English and Xhosa.
The unit-selection synthesiser selects short waveforms from a recorded speech corpus,
and concatenates them to produce the required utterances. Techniques are developed for
designing a compact corpus and processing it to produce a unit-selection database. Speech
modification methods were researched to build a framework for natural-sounding speech
concatenation. This framework also provides pitch and duration modification capabilities
that will enable research in languages such as Afrikaans and Xhosa where text-to-speech
capabilities are relatively immature. / AFRIKAANSE OPSOMMING: Telefoniese, spraakgebaseerde dialoogstelsels word steeds meer algemeen, en is ’n doeltreffende
metode om oproepsentrumkostes te verlaag. Dit is tans tegnologies moontlik om
’n groot aantal eenvoudige transaksies met automatiese stelsels te hanteer. Sulke stelsels
gebruik spraakherkenning om intree van die gebruiker te ontvang. Hierdie werk fokus op
die spraakgenerasiekomponent wat ’n spesifieke prototipestelsel gebruik om afvoer aan
die gebruiker terug te speel.
Vele kommersi¨ele stelsels gebruik generiese teks-na-spraak sintetiseerders. Sulke teksna-
spraak sintetiseerders is steeds ’n baie aktiewe veld in spraaknavorsing. In die algemeen
poog navorsing om teks te kan lees en om te sit in verstaanbare spraak. Sulke stelsels
bestaan nou al vir ten minste twee dekades. Alhoewel heeltemal verstaanbaar, klink
hierdie stelsels onnatuurlik. In telefoniese spraakgebaseerde dialoogstelsels is kwaliteit
van die sintetiese spraak belangrik vir die gebruiker se persepsie van die stelsel se kwaliteit
en bruikbaarheid.
Die dialoog is meestal staties van aard en hierdie eienskap word benut om ho¨e kwaliteit
spraak in ’n bepaalde toepassing te sintetiseer. Om dit reg te kry is die huidige stand van
sake in hierdie veld bestudeer en opgesom. ’n Knip-en-plak sintetiseerder is gebou wat
werk in Afrikaans, Engels en Xhosa.
Die sintetiseerder selekteer kort stukkies spraakgolfvorms vanuit ’n spraakkorpus, en
las dit aanmekaar om die vereiste spraak te produseer. Outomatiese tegnieke is ontwikkel
om ’n kompakte korpus te ontwerp wat steeds alles bevat wat die sintetiseerder sal nodig
hˆe om sy taak te verrig. Verdere tegnieke prosesseer die korpus tot ’n bruikbare vorm vir
sintese.
Metodes van spraakmodifikasie is ondersoek ten einde die aanmekaargelaste stukkies
spraak meer natuurlik te laat klink en die intonasie en tempo daarvan te korrigeer. Dit
verskaf infrastruktuur vir navorsing in tale soos Afrikaans en Xhosa waar teks-na-spraak
vermo¨ens nog onvolwasse is.
|
Page generated in 0.0704 seconds