• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 3
  • 1
  • Tagged with
  • 10
  • 7
  • 7
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Statistical analysis, modelling and synthesis of voice for text to speech synthesis

Low, Phuay Hui January 2004 (has links)
No description available.
2

Databases for concatenative text-to-speech synthesis systems : unit selection and knowledge-based approach

Lambert, Tanya January 2005 (has links)
No description available.
3

Robust acoustic speech feature prediction from Mel frequency cepstral coefficients

Darch, Jonathan J. A. January 2008 (has links)
No description available.
4

Ανάλυση και μοντελοποίηση των φωνητικών παραμέτρων και προσαρμογή συστήματος σύνθεσης ομιλίας σε συγκεκριμένο ομιλιτή

Δαρσίνος, Βασίλειος 16 September 2009 (has links)
- / -
5

Synthesis and evaluation of conversational characteristics in speech synthesis

Andersson, Johan Sebastian January 2013 (has links)
Conventional synthetic voices can synthesise neutral read aloud speech well. But, to make synthetic speech more suitable for a wider range of applications, the voices need to express more than just the word identity. We need to develop voices that can partake in a conversation and express, e.g. agreement, disagreement, hesitation, in a natural and believable manner. In speech synthesis there are currently two dominating frameworks: unit selection and HMM-based speech synthesis. Both frameworks utilise recordings of human speech to build synthetic voices. Despite the fact that the content of the recordings determines the segmental and prosodic phenomena that can be synthesised, surprisingly little research has been made on utilising the corpus to extend the limited behaviour of conventional synthetic voices. In this thesis we will show how natural sounding conversational characteristics can be added to both unit selection and HMM-based synthetic voices, by adding speech from a spontaneous conversation to the voices. We recorded a spontaneous conversation, and by manually transcribing and selecting utterances we obtained approximately two thousand utterances from it. These conversational utterances were rich in conversational speech phenomena, but they lacked the general coverage that allows unit selection and HMM-based synthesis techniques to synthesise high quality speech. Therefore we investigated a number of blending approaches in the synthetic voices, where the conversational utterances were augmented with conventional read aloud speech. The synthetic voices that contained conversational speech were contrasted with conventional voices without conversational speech. The perceptual evaluations showed that the conversational voices were generally perceived by listeners as having a more conversational style than the conventional voices. This conversational style was largely due to the conversational voices’ ability to synthesise utterances that contained conversational speech phenomena in a more natural manner than the conventional voices. Additionally, we conducted an experiment that showed that natural sounding conversational characteristics in synthetic speech can convey pragmatic information, in our case an impression of certainty or uncertainty, about a topic to a listener. The conclusion drawn is that the limited behaviour of conventional synthetic voices can be enriched by utilising conversational speech in both unit selection and HMM-based speech synthesis.
6

Identifying prosodic prominence patterns for English text-to-speech synthesis

Badino, Leonardo January 2010 (has links)
This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthesis by identifying and generating natural patterns of prosodic prominence. In most state-of-the-art TTS systems the prediction from text of prosodic prominence relations between words in an utterance relies on features that very loosely account for the combined effects of syntax, semantics, word informativeness and salience, on prosodic prominence. To improve prosodic prominence prediction we first follow up the classic approach in which prosodic prominence patterns are flattened into binary sequences of pitch accented and pitch unaccented words. We propose and motivate statistic and syntactic dependency based features that are complementary to the most predictive features proposed in previous works on automatic pitch accent prediction and show their utility on both read and spontaneous speech. Different accentuation patterns can be associated to the same sentence. Such variability rises the question on how evaluating pitch accent predictors when more patterns are allowed. We carry out a study on prosodic symbols variability on a speech corpus where different speakers read the same text and propose an information-theoretic definition of optionality of symbolic prosodic events that leads to a novel evaluation metric in which prosodic variability is incorporated as a factor affecting prediction accuracy. We additionally propose a method to take advantage of the optionality of prosodic events in unit-selection speech synthesis. To better account for the tight links between the prosodic prominence of a word and the discourse/sentence context, part of this thesis goes beyond the accent/no-accent dichotomy and is devoted to a novel task, the automatic detection of contrast, where contrast is meant as a (Information Structure’s) relation that ties two words that explicitly contrast with each other. This task is mainly motivated by the fact that contrastive words tend to be prosodically marked with particularly prominent pitch accents. The identification of contrastive word pairs is achieved by combining lexical information, syntactic information (which mainly aims to identify the syntactic parallelism that often activates contrast) and semantic information (mainly drawn from the Word- Net semantic lexicon), within a Support Vector Machines classifier. Once we have identified patterns of prosodic prominence we propose methods to incorporate such information in TTS synthesis and test its impact on synthetic speech naturalness trough some large scale perceptual experiments. The results of these experiments cast some doubts on the utility of a simple accent/no-accent distinction in Hidden Markov Model based speech synthesis while highlight the importance of contrastive accents.
7

Visual speech synthesis by learning joint probabilistic models of audio and video

Deena, Salil Prashant January 2012 (has links)
Visual speech synthesis deals with synthesising facial animation from an audio representation of speech. In the last decade or so, data-driven approaches have gained prominence with the development of Machine Learning techniques that can learn an audio-visual mapping. Many of these Machine Learning approaches learn a generative model of speech production using the framework of probabilistic graphical models, through which efficient inference algorithms can be developed for synthesis. In this work, the audio and visual parameters are assumed to be generated from an underlying latent space that captures the shared information between the two modalities. These latent points evolve through time according to a dynamical mapping and there are mappings from the latent points to the audio and visual spaces respectively. The mappings are modelled using Gaussian processes, which are non-parametric models that can represent a distribution over non-linear functions. The result is a non-linear state-space model. It turns out that the state-space model is not a very accurate generative model of speech production because it assumes a single dynamical model, whereas it is well known that speech involves multiple dynamics (for e.g. different syllables) that are generally non-linear. In order to cater for this, the state-space model can be augmented with switching states to represent the multiple dynamics, thus giving a switching state-space model. A key problem is how to infer the switching states so as to model the multiple non-linear dynamics of speech, which we address by learning a variable-order Markov model on a discrete representation of audio speech. Various synthesis methods for predicting visual from audio speech are proposed for both the state-space and switching state-space models. Quantitative evaluation, involving the use of error and correlation metrics between ground truth and synthetic features, is used to evaluate our proposed method in comparison to other probabilistic models previously applied to the problem. Furthermore, qualitative evaluation with human participants has been conducted to evaluate the realism, perceptual characteristics and intelligibility of the synthesised animations. The results are encouraging and demonstrate that by having a joint probabilistic model of audio and visual speech that caters for the non-linearities in audio-visual mapping, realistic visual speech can be synthesised from audio speech.
8

Developing an enriched natural language grammar for prosodically-improved concent-to-speech synthesis

Marais, Laurette 04 1900 (has links)
The need for interacting with machines using spoken natural language is growing, along with the expectation that synthetic speech in this context sound natural. Such interaction includes answering questions, where prosody plays an important role in producing natural English synthetic speech by communicating the information structure of utterances. CCG is a theoretical framework that exploits the notion that, in English, information structure, prosodic structure and syntactic structure are isomorphic. This provides a way to convert a semantic representation of an utterance into a prosodically natural spoken utterance. GF is a framework for writing grammars, where abstract tree structures capture the semantic structure and concrete grammars render these structures in linearised strings. This research combines these frameworks to develop a system that converts semantic representations of utterances into linearised strings of natural language that are marked up to inform the prosody-generating component of a speech synthesis system. / Computing / M. Sc. (Computing)
9

Synthèse acoustico-visuelle de la parole par sélection d'unités bimodales / Acoustic-Visual Speech Synthesis by Bimodal Unit Selection

Musti, Utpala 21 February 2013 (has links)
Ce travail porte sur la synthèse de la parole audio-visuelle. Dans la littérature disponible dans ce domaine, la plupart des approches traite le problème en le divisant en deux problèmes de synthèse. Le premier est la synthèse de la parole acoustique et l'autre étant la génération d'animation faciale correspondante. Mais, cela ne garantit pas une parfaite synchronisation et cohérence de la parole audio-visuelle. Pour pallier implicitement l'inconvénient ci-dessus, nous avons proposé une approche de synthèse de la parole acoustique-visuelle par la sélection naturelle des unités synchrones bimodales. La synthèse est basée sur le modèle de sélection d'unité classique. L'idée principale derrière cette technique de synthèse est de garder l'association naturelle entre la modalité acoustique et visuelle intacte. Nous décrivons la technique d'acquisition de corpus audio-visuelle et la préparation de la base de données pour notre système. Nous présentons une vue d'ensemble de notre système et nous détaillons les différents aspects de la sélection d'unités bimodales qui ont besoin d'être optimisées pour une bonne synthèse. L'objectif principal de ce travail est de synthétiser la dynamique de la parole plutôt qu'une tête parlante complète. Nous décrivons les caractéristiques visuelles cibles que nous avons conçues. Nous avons ensuite présenté un algorithme de pondération de la fonction cible. Cet algorithme que nous avons développé effectue une pondération de la fonction cible et l'élimination de fonctionnalités redondantes de manière itérative. Elle est basée sur la comparaison des classements de coûts cible et en se basant sur une distance calculée à partir des signaux de parole acoustiques et visuels dans le corpus. Enfin, nous présentons l'évaluation perceptive et subjective du système de synthèse final. Les résultats montrent que nous avons atteint l'objectif de synthétiser la dynamique de la parole raisonnablement bien / This work deals with audio-visual speech synthesis. In the vast literature available in this direction, many of the approaches deal with it by dividing it into two synthesis problems. One of it is acoustic speech synthesis and the other being the generation of corresponding facial animation. But, this does not guarantee a perfectly synchronous and coherent audio-visual speech. To overcome the above drawback implicitly, we proposed a different approach of acoustic-visual speech synthesis by the selection of naturally synchronous bimodal units. The synthesis is based on the classical unit selection paradigm. The main idea behind this synthesis technique is to keep the natural association between the acoustic and visual modality intact. We describe the audio-visual corpus acquisition technique and database preparation for our system. We present an overview of our system and detail the various aspects of bimodal unit selection that need to be optimized for good synthesis. The main focus of this work is to synthesize the speech dynamics well rather than a comprehensive talking head. We describe the visual target features that we designed. We subsequently present an algorithm for target feature weighting. This algorithm that we developed performs target feature weighting and redundant feature elimination iteratively. This is based on the comparison of target cost based ranking and a distance calculated based on the acoustic and visual speech signals of units in the corpus. Finally, we present the perceptual and subjective evaluation of the final synthesis system. The results show that we have achieved the goal of synthesizing the speech dynamics reasonably well
10

Developing an enriched natural language grammar for prosodically-improved concent-to-speech synthesis

Marais, Laurette 04 1900 (has links)
The need for interacting with machines using spoken natural language is growing, along with the expectation that synthetic speech in this context sound natural. Such interaction includes answering questions, where prosody plays an important role in producing natural English synthetic speech by communicating the information structure of utterances. CCG is a theoretical framework that exploits the notion that, in English, information structure, prosodic structure and syntactic structure are isomorphic. This provides a way to convert a semantic representation of an utterance into a prosodically natural spoken utterance. GF is a framework for writing grammars, where abstract tree structures capture the semantic structure and concrete grammars render these structures in linearised strings. This research combines these frameworks to develop a system that converts semantic representations of utterances into linearised strings of natural language that are marked up to inform the prosody-generating component of a speech synthesis system. / Computing / M. Sc. (Computing)

Page generated in 0.0177 seconds