Spelling suggestions: "subject:"descriven coacial animation"" "subject:"descriven coacial donimation""
1 |
Visual prosody in speech-driven facial animation: elicitation, prediction, and perceptual evaluationZavala Chmelicka, Marco Enrique 29 August 2005 (has links)
Facial animations capable of articulating accurate movements in synchrony with a
speech track have become a subject of much research during the past decade. Most of
these efforts have focused on articulation of lip and tongue movements, since these are
the primary sources of information in speech reading. However, a wealth of
paralinguistic information is implicitly conveyed through visual prosody (e.g., head and
eyebrow movements). In contrast with lip/tongue movements, however, for which the
articulation rules are fairly well known (i.e., viseme-phoneme mappings, coarticulation),
little is known about the generation of visual prosody.
The objective of this thesis is to explore the perceptual contributions of visual prosody in
speech-driven facial avatars. Our main hypothesis is that visual prosody driven by
acoustics of the speech signal, as opposed to random or no visual prosody, results in
more realistic, coherent and convincing facial animations. To test this hypothesis, we
have developed an audio-visual system capable of capturing synchronized speech and
facial motion from a speaker using infrared illumination and retro-reflective markers. In
order to elicit natural visual prosody, a story-telling experiment was designed in which
the actors were shown a short cartoon video, and subsequently asked to narrate the
episode. From this audio-visual data, four different facial animations were generated,
articulating no visual prosody, Perlin-noise, speech-driven movements, and ground truth
movements. Speech-driven movements were driven by acoustic features of the speech
signal (e.g., fundamental frequency and energy) using rule-based heuristics and
autoregressive models. A pair-wise perceptual evaluation shows that subjects can clearly
discriminate among the four visual prosody animations. It also shows that speech-driven
movements and Perlin-noise, in that order, approach the performance of veridical
motion. The results are quite promising and suggest that speech-driven motion could
outperform Perlin-noise if more powerful motion prediction models are used. In
addition, our results also show that exaggeration can bias the viewer to perceive a
computer generated character to be more realistic motion-wise.
|
2 |
Visual speech synthesis by learning joint probabilistic models of audio and videoDeena, Salil Prashant January 2012 (has links)
Visual speech synthesis deals with synthesising facial animation from an audio representation of speech. In the last decade or so, data-driven approaches have gained prominence with the development of Machine Learning techniques that can learn an audio-visual mapping. Many of these Machine Learning approaches learn a generative model of speech production using the framework of probabilistic graphical models, through which efficient inference algorithms can be developed for synthesis. In this work, the audio and visual parameters are assumed to be generated from an underlying latent space that captures the shared information between the two modalities. These latent points evolve through time according to a dynamical mapping and there are mappings from the latent points to the audio and visual spaces respectively. The mappings are modelled using Gaussian processes, which are non-parametric models that can represent a distribution over non-linear functions. The result is a non-linear state-space model. It turns out that the state-space model is not a very accurate generative model of speech production because it assumes a single dynamical model, whereas it is well known that speech involves multiple dynamics (for e.g. different syllables) that are generally non-linear. In order to cater for this, the state-space model can be augmented with switching states to represent the multiple dynamics, thus giving a switching state-space model. A key problem is how to infer the switching states so as to model the multiple non-linear dynamics of speech, which we address by learning a variable-order Markov model on a discrete representation of audio speech. Various synthesis methods for predicting visual from audio speech are proposed for both the state-space and switching state-space models. Quantitative evaluation, involving the use of error and correlation metrics between ground truth and synthetic features, is used to evaluate our proposed method in comparison to other probabilistic models previously applied to the problem. Furthermore, qualitative evaluation with human participants has been conducted to evaluate the realism, perceptual characteristics and intelligibility of the synthesised animations. The results are encouraging and demonstrate that by having a joint probabilistic model of audio and visual speech that caters for the non-linearities in audio-visual mapping, realistic visual speech can be synthesised from audio speech.
|
Page generated in 0.11 seconds