Global ETD Search

11	Séparation de sources en ligne dans des environnements réverbérants en exploitant la localisation des sources / Online source separation in reverberant environments exploiting known speaker locations Harris, Jack 12 October 2015 (has links) Cette thèse porte sur les techniques de séparation de sources en aveugle en utilisant des statistiques de second ordre et statistiques d'ordresupérieur pour les environnements de réverbération. Un objectif de la thèse est la simplicité algorithmique en vue de l'implantation en lignedes algorithmes. Le principal défi des applications de séparation de sources aveugles est de s'occuper des environnements acoustiques de réverbération; une complication supplémentaire concerne les changements dans l'environnement acoustique lorsque les sources humaines se déplacent physiquement.Une nouvelle méthode dans le domaine temporel qui utilise une paire de filtres à réponse impulsionnelle finie est proposée. Cette méthode, dite les angles principaux, sur un décomposition en valeurs singulières. Une paire de filtres, jouant le rôle de formation de voie, est estimée de façon à annuler une des sources. Une étape de filtrage adaptatif estensuite utilisée pour récupérer la source restante, en exploitant la sortie de l'étage de beamforming en tant que une référence de bruit. Une approche commune de la séparation de sources aveugle est d'utiliser des méthodes fondée sur les statistiques d'ordre supérieur comme l'analyse en composantes indépendantes. Cependant, pour des mélanges convolutifs audio et vocales réalistes, la transformation dansle domaine fréquentiel pour chaque fréquence de calcul est nécessaire. Ceci introduit le problème de permutations, inhérentes à l'analyse en composantes indépendantes, pour tout les fréquences. L'analyse en vecteur indépendant résout directement cette question par la modélisation des dépendances entre les fréquences de calcul, à partir d'a priori sur les sources. Un algorithme de gradient naturel en temps réel est également proposé proposé avec un autre a priori sur les sources. Cette méthode exploite la fonction de densité de probabilité de Student, est connu pour être bien adapté pour les sources de parole, en raison de queues de distribution plus lourdes. L'algorithme final est implanté en temps réel sur un processeur numérique de signal à virgule flottante de Texas Instruments.Les sources mobiles, avec des environnements réverbérant, causent des problèmes significatifs dans les systèmes de séparation desources réalistes car les filtres de mélange deviennent variants dans le temps. Dans ce cadre, une méthode qui utilise conjointement leprincipe de la paire de filtres d'annulation et le principe de l'analyse en vecteurs indépendant. Cette approche permet de limiter les baisses de performance lorsque les sources sont mobiles. Les résultats montrent également que les temps moyen de convergence des divers paramètres sont diminués.Les méthodes en ligne qui sont introduites dans la thèse, sont testées en utilisant des réponses impulsionnelles mesurées dans des environnements de réverbération. Les résultats montrent leur robustesse et d'excellentes performances par rapport à d'autres méthodes classique, dans plusieurs situations expérimentales. / Methods for improving the real-time performance and speed of various source enhancement and separation are considered. Two themes of research are considered so far: a method which relies only on second order statistics to enhance a target source exploiting video cues. Secondly, a higher-order statistics method, independent vector analysis is implemented in real-time on a digital signal processor, where an alternative source prior has been used performance is shown to have improved. Parole Audiovisuelle Separation de Source Amélioration de source audio Source Separation Audio-Visual Speech Audio source enhancement 004
12	Selective attention and speech processing in the cortex Rajaram, Siddharth 24 September 2015 (has links) In noisy and complex environments, human listeners must segregate the mixture of sound sources arriving at their ears and selectively attend a single source, thereby solving a computationally difficult problem called the cocktail party problem. However, the neural mechanisms underlying these computations are still largely a mystery. Oscillatory synchronization of neuronal activity between cortical areas is thought to provide a crucial role in facilitating information transmission between spatially separated populations of neurons, enabling the formation of functional networks. In this thesis, we seek to analyze and model the functional neuronal networks underlying attention to speech stimuli and find that the Frontal Eye Fields play a central 'hub' role in the auditory spatial attention network in a cocktail party experiment. We use magnetoencephalography (MEG) to measure neural signals with high temporal precision, while sampling from the whole cortex. However, several methodological issues arise when undertaking functional connectivity analysis with MEG data. Specifically, volume conduction of electrical and magnetic fields in the brain complicates interpretation of results. We compare several approaches through simulations, and analyze the trade-offs among various measures of neural phase-locking in the presence of volume conduction. We use these insights to study functional networks in a cocktail party experiment. We then construct a linear dynamical system model of neural responses to ongoing speech. Using this model, we are able to correctly predict which of two speakers is being attended by a listener. We then apply this model to data from a task where people were attending to stories with synchronous and scrambled videos of the speakers' faces to explore how the presence of visual information modifies the underlying neuronal mechanisms of speech perception. This model allows us to probe neural processes as subjects listen to long stimuli, without the need for a trial-based experimental design. We model the neural activity with latent states, and model the neural noise spectrum and functional connectivity with multivariate autoregressive dynamics, along with impulse responses for external stimulus processing. We also develop a new regularized Expectation-Maximization (EM) algorithm to fit this model to electroencephalography (EEG) data. Neurosciences Attention Audio-visual speech Cross-modal EEG MEG Speech tracking
13	Audiovisual Integration in Native and Non-native Speech Perception Harrison, Margaret Elizabeth 28 April 2016 (has links) No description available. English As A Second Language visual speech fricative non-native speech perception McGurk
14	Development of auditory-visual speech perception in young children Erdener, Vahit Dogu, University of Western Sydney, College of Arts, School of Psychology January 2007 (has links) Unlike auditory-only speech perception, little is known about the development of auditory-visual speech perception. Recent studies show that pre-linguistic infants perceive auditory-visual speech phonetically in the absence of any phonological experience. In addition, while an increase in visual speech influence over age is observed in English speakers, particularly between six and eight years, this is not the case in Japanese speakers. This thesis aims to investigate the factors that lead to an increase in visual speech influence in English speaking children aged between 3 and 8 years. The general hypothesis of this thesis is that age-related, language-specific factors will be related to auditory-visual speech perception. Three experiments were conducted here. Results show that in linguistically challenging periods, such as school onset and reading acquisition, there is a strong link between auditory visual and language specific speech perception, and that this link appears to help cope with new linguistic challenges. However this link does not seem to be present in adults or preschool children, for whom auditory visual speech perception is predictable from auditory speech perception ability alone. Implications of these results in relation to existing models of auditory-visual speech perception and directions for future studies are discussed. / Doctor of Philosophy (PhD) auditory-visual speech perception psycholinguistics speech perception auditory perception visual perception human information processing speech development in children
15	Video Analysis of Mouth Movement Using Motion Templates for Computer-based Lip-Reading Yau, Wai Chee, waichee@ieee.org January 2008 (has links) This thesis presents a novel lip-reading approach to classifying utterances from video data, without evaluating voice signals. This work addresses two important issues which are the efficient representation of mouth movement for visual speech recognition the temporal segmentation of utterances from video. The first part of the thesis describes a robust movement-based technique used to identify mouth movement patterns while uttering phonemes. This method temporally integrates the video data of each phoneme into a 2-D grayscale image named as a motion template (MT). This is a view-based approach that implicitly encodes the temporal component of an image sequence into a scalar-valued MT. The data size was reduced by extracting image descriptors such as Zernike moments (ZM) and discrete cosine transform (DCT) coefficients from MT. Support vector machine (SVM) and hidden Markov model (HMM) were used to classify the feature descriptors. A video speech corpus of 2800 utterances was collected for evaluating the efficacy of MT for lip-reading. The experimental results demonstrate the promising performance of MT in mouth movement representation. The advantages and limitations of MT for visual speech recognition were identified and validated through experiments. A comparison between ZM and DCT features indicates that th e accuracy of classification for both methods is very comparable when there is no relative motion between the camera and the mouth. Nevertheless, ZM is resilient to rotation of the camera and continues to give good results despite rotation but DCT is sensitive to rotation. DCT features are demonstrated to have better tolerance to image noise than ZM. The results also demonstrate a slight improvement of 5% using SVM as compared to HMM. The second part of this thesis describes a video-based, temporal segmentation framework to detect key frames corresponding to the start and stop of utterances from an image sequence, without using the acoustic signals. This segmentation technique integrates mouth movement and appearance information. The efficacy of this technique was tested through experimental evaluation and satisfactory performance was achieved. This segmentation method has been demonstrated to perform efficiently for utterances separated with short pauses. Potential applications for lip-reading technologies include human computer interface (HCI) for mobility-impaired users, defense applications that require voice-less communication, lip-reading mobile phones, in-vehicle systems, and improvement of speech-based computer control in noisy environments. video analysis visual speech recognition motion template Zernike moments discrete cosine transform support vector machines hidden Markov Models
16	Visual speech synthesis by learning joint probabilistic models of audio and video Deena, Salil Prashant January 2012 (has links) Visual speech synthesis deals with synthesising facial animation from an audio representation of speech. In the last decade or so, data-driven approaches have gained prominence with the development of Machine Learning techniques that can learn an audio-visual mapping. Many of these Machine Learning approaches learn a generative model of speech production using the framework of probabilistic graphical models, through which efficient inference algorithms can be developed for synthesis. In this work, the audio and visual parameters are assumed to be generated from an underlying latent space that captures the shared information between the two modalities. These latent points evolve through time according to a dynamical mapping and there are mappings from the latent points to the audio and visual spaces respectively. The mappings are modelled using Gaussian processes, which are non-parametric models that can represent a distribution over non-linear functions. The result is a non-linear state-space model. It turns out that the state-space model is not a very accurate generative model of speech production because it assumes a single dynamical model, whereas it is well known that speech involves multiple dynamics (for e.g. different syllables) that are generally non-linear. In order to cater for this, the state-space model can be augmented with switching states to represent the multiple dynamics, thus giving a switching state-space model. A key problem is how to infer the switching states so as to model the multiple non-linear dynamics of speech, which we address by learning a variable-order Markov model on a discrete representation of audio speech. Various synthesis methods for predicting visual from audio speech are proposed for both the state-space and switching state-space models. Quantitative evaluation, involving the use of error and correlation metrics between ground truth and synthetic features, is used to evaluate our proposed method in comparison to other probabilistic models previously applied to the problem. Furthermore, qualitative evaluation with human participants has been conducted to evaluate the realism, perceptual characteristics and intelligibility of the synthesised animations. The results are encouraging and demonstrate that by having a joint probabilistic model of audio and visual speech that caters for the non-linearities in audio-visual mapping, realistic visual speech can be synthesised from audio speech. 006.54
17	A facial animation model for expressive audio-visual speech Somasundaram, Arunachalam 21 September 2006 (has links) No description available. Computer Science expressive facial speech animation expressive audio-visual speech facial animation speech animation facial expressions face speech emotions
18	A cor e o figurino na construção de personagens na narrativa televisual: um estudo de caso da minissérie Capitu / The color and the costumes in building characters in televisual narrative: a case study of the miniseries Capitu Leite, Rafaela Bernardazzi Torrens 22 September 2015 (has links) A presente pesquisa deteve-se na análise da produção de sentido da minissérie Capitu (Globo, 2008) tendo como base o uso da cor e do figurino na caracterização das personagens Bento, Capitu e Escobar. Norteia este trabalho, a concepção de que tanto o uso da cor quanto do figurino mostram-se como elementos fundamentais seja para a estruturação da narrativa, seja para a construção do discurso visual. Por meio do estudo desses elementos procurou-se explorar a construção da narrativa televisual considerando a indissociável relação entre forma e conteúdo (BAKHTIN, 2010) com o objetivo de compreender a construção visual das três personagens acima citadas. Também foi objeto de análise as possíveis correlações entre os figurinos da minissérie e a moda do período em que se ambienta o romance de Machado de Assis, ou seja, meados do século XIX. O trabalho de fundamentação teórica e metodológica embasou-se nos estudos de linguagem de Bakhtin (2002, 2008, 2010) e Bakhtin/Volochinov (2009), de linguagem audiovisual de Aumont (1993), Aumont e Marie (2003) e Brown (2012). Pallottini (2012), Mungioli (2013) fornecem elementos para a contextualização e discussão das minisséries brasileiras. Além disso, abordamos o estudo da teoria das cores com Guimarães (2000), Pedrosa (2010), Bastos, Farina e Perez (2011) e o alicerce de vestuário e figurino com Köhler (2009), Chataignier (2010), Lipovetsky (2009) e Leite e Guerra (2002). Devido à essencial correlação entre forma e conteúdo que orientou a pesquisa e análise, a metodologia aplicada foi desenhada por meio do estudo da linguagem verbo-visual, destacando em sua composição a visualidade das roupas e cores na construção de sentidos. As características das imagens envolvidas nessa pesquisa não podiam ser abordadas pensando no limite da mesma, pois seu experimentalismo tanto na linguagem audiovisual como na construção narrativa da obra, não caberia em padrões fechados de observação e análise. Assim, o estudo buscou explorar a minissérie sem restringi-la a uma análise puramente técnica, realizando um constante diálogo com o sentido produzido pelo trabalho de produção da obra audiovisual. A análise realizada permitiu considerar a cor como elemento diegético constituinte de uma poética televisual e da tessitura composicional das personagens analisadas. Ao longo da minissérie, há alterações importantes na composição visual, as nuances da caracterização dos protagonistas surgem não apenas com uma função plástica visual, mas também com uma função narrativa. / The present study intends the analysis of the production of meaning in miniseries Capitu (Globo, 2008) based on the use of color and costume in the characterization of the main characters Bento, Capitu and Escobar. Guides this work, the idea that both the use of color and the costume are shown key for structuring the narrative, as to build the visual discourse. By studying these elements sought to explore the construction of televisual narrative considering the inseparable relationship between form and content (BAKHTIN, 2010) in order to understand the visual construction of the three characters mentioned above. It has also been analyzed in the possible correlations between the costumes of the miniseries and the fashion of the period in which settles the romance of Machado de Assis, namely in mid-nineteenth century. The work of theoretical and methodological foundation underwrote us to study language by Bakhtin (2002, 2008, 2010) and Bakhtin / Volochinov (2009), audiovisual language Aumont (1993), Aumont and Marie (2003) and Brown (2012). Pallottini (2012), Mungioli (2013) provide input for context and discussion of brazilian miniseries. In addition, we approach the study of color theory with Guimarães (2000), Pedrosa (2010), Bastos, Farina and Perez (2011) and the foundation of clothing and costumes with Kohler (2009), Chataignier (2010), Lipovetsky (2009) and Leite and Guerra (2002). Because of the essential correlation between form and content that guided the research and analysis, the methodology applied was designed by studying the verb-visual language, highlighting in its composition the visuality of clothes and colors to build senses. The characteristics of the images involved in this research can not be addressed thinking on the edge of it, for his experimentation both in audiovisual language and the narrative construction of the work, it would not fit in closed standards of observation and analysis. Thus, the study sought to explore the miniseries without restricting it to a purely technical analysis, conducting a constant dialogue with the meaning produced by the audiovisual production work. The analysis allowed us to consider color as diegetic constituent element of a televisual poetic and compositional fabric of the analyzed characters. Throughout the miniseries, there are important changes in visual composition, the nuances of characterization of the protagonists emerge not only with a plastic visual function but also with a narrative function. costumes and color in build characters discurso visual ficção televisiva miniseries Capitu minissérie Capitu narrativa televisual television fiction televisual narrative visual speech
19	Synthèse acoustico-visuelle de la parole par sélection d'unités bimodales / Acoustic-Visual Speech Synthesis by Bimodal Unit Selection Musti, Utpala 21 February 2013 (has links) Ce travail porte sur la synthèse de la parole audio-visuelle. Dans la littérature disponible dans ce domaine, la plupart des approches traite le problème en le divisant en deux problèmes de synthèse. Le premier est la synthèse de la parole acoustique et l'autre étant la génération d'animation faciale correspondante. Mais, cela ne garantit pas une parfaite synchronisation et cohérence de la parole audio-visuelle. Pour pallier implicitement l'inconvénient ci-dessus, nous avons proposé une approche de synthèse de la parole acoustique-visuelle par la sélection naturelle des unités synchrones bimodales. La synthèse est basée sur le modèle de sélection d'unité classique. L'idée principale derrière cette technique de synthèse est de garder l'association naturelle entre la modalité acoustique et visuelle intacte. Nous décrivons la technique d'acquisition de corpus audio-visuelle et la préparation de la base de données pour notre système. Nous présentons une vue d'ensemble de notre système et nous détaillons les différents aspects de la sélection d'unités bimodales qui ont besoin d'être optimisées pour une bonne synthèse. L'objectif principal de ce travail est de synthétiser la dynamique de la parole plutôt qu'une tête parlante complète. Nous décrivons les caractéristiques visuelles cibles que nous avons conçues. Nous avons ensuite présenté un algorithme de pondération de la fonction cible. Cet algorithme que nous avons développé effectue une pondération de la fonction cible et l'élimination de fonctionnalités redondantes de manière itérative. Elle est basée sur la comparaison des classements de coûts cible et en se basant sur une distance calculée à partir des signaux de parole acoustiques et visuels dans le corpus. Enfin, nous présentons l'évaluation perceptive et subjective du système de synthèse final. Les résultats montrent que nous avons atteint l'objectif de synthétiser la dynamique de la parole raisonnablement bien / This work deals with audio-visual speech synthesis. In the vast literature available in this direction, many of the approaches deal with it by dividing it into two synthesis problems. One of it is acoustic speech synthesis and the other being the generation of corresponding facial animation. But, this does not guarantee a perfectly synchronous and coherent audio-visual speech. To overcome the above drawback implicitly, we proposed a different approach of acoustic-visual speech synthesis by the selection of naturally synchronous bimodal units. The synthesis is based on the classical unit selection paradigm. The main idea behind this synthesis technique is to keep the natural association between the acoustic and visual modality intact. We describe the audio-visual corpus acquisition technique and database preparation for our system. We present an overview of our system and detail the various aspects of bimodal unit selection that need to be optimized for good synthesis. The main focus of this work is to synthesize the speech dynamics well rather than a comprehensive talking head. We describe the visual target features that we designed. We subsequently present an algorithm for target feature weighting. This algorithm that we developed performs target feature weighting and redundant feature elimination iteratively. This is based on the comparison of target cost based ranking and a distance calculated based on the acoustic and visual speech signals of units in the corpus. Finally, we present the perceptual and subjective evaluation of the final synthesis system. The results show that we have achieved the goal of synthesizing the speech dynamics reasonably well Synthèse de la parole audio-visuelle Sélection de l'unité Coût cible Pondération fonction cible Audio-visual speech synthesis Unit selection Target cost Target feature weighting 006.54
20	The Effect of Static and Dynamic Visual Gestures on Stuttering Inhibition Guntupalli, Vijaya K., Nanjundeswaran (Guntupalli), Chaya D., Kalinowski, Joseph, Dayalu, Vikram N. 29 March 2011 (has links) The aim of the study was to evaluate the role of steady-state and dynamic visual gestures of vowels in stuttering inhibition. Eight adults who stuttered recited sentences from memory while watching video presentations of the following visual speech gestures: (a) a steady-state /u/, (b) dynamic production of /a-i-u/, (c) steady-state /u/ with an accompanying audible 1kHz pure tone, and (d) dynamic production of /a-i-u/ with an accompanying audible 1kHz pure tone. A 1kHz pure tone and a no-external signal condition served as control conditions. Results revealed a significant main effect of auditory condition on stuttering frequency. Relative to the no-external signal condition, the combined visual plus pure tone conditions resulted in a statistically significant reduction in stuttering frequency. In addition, a significant difference in stuttering frequency was also observed when the visual plus pure tone conditions were compared to the visual only conditions. However, no significant differences were observed between the no-external signal condition and visual only conditions, or the no-external signal condition and pure tone condition. These findings are in contrast to previous findings demonstrated by similar vowel gestures presented via the auditory modality that resulted in high levels of stuttering inhibition. The differential role of sensory modalities in speech perception and production as well as their individual capacities to transfer gestural information for the purposes of stuttering inhibition is discussed. Audio–visual speech perception Choral speech Gestures Mirror neurons Stuttering Audiology and Speech-Language Pathology Medical Pathology Speech and Hearing Science Speech Pathology and Audiology

Search results