Spelling suggestions: "subject:"cisual speech"" "subject:"4visual speech""
1 |
Visual language discriminationWeikum, Whitney Marie 05 1900 (has links)
Recognizing and learning one’s native language requires knowledge of the phonetic and rhythmical characteristics of the language. Few studies address the rich source of language information available in a speaker’s face. Solely visual speech permits language discrimination in adults (Soto-Faraco et al., 2007). This thesis tested infants and adults on their ability to use only information available in a speaker’s face to discriminate rhythmically dissimilar languages.
Monolingual English infants discriminated French and English using only visual speech at 4 and 6 months old, but failed this task at 8 months old. To test the role of language experience, bilingual (English/French) 6 and 8-month-old infants were tested and successfully discriminated the languages. An optimal period for sensitivity to visual language information necessary for discriminating languages may exist in early life.
To confirm an optimal period, adults who had acquired English as a second language were tested. If English was learned before age 6 years, adults discriminated English and French, but if English was learned after age 6, adults performed at chance. Experience with visual speech information in early childhood influences adult performance.
To better understand the developmental trajectory of visual language discrimination, visual correlates of phonetic segments and rhythmical information were examined. When clips were manipulated to remove rhythmical information, infants used segmental visual phonetic cues to discriminate languages at 4, but not 8 months old. This suggests that a decline in non-native visual phonetic discrimination (similar to the decline seen for non-native auditory phonetic information; Werker & Tees, 1984), may be impairing language discrimination at 8 months.
Infants as young as newborn use rhythmical auditory information to discriminate languages presented forward, but not backward (Mehler et al., 1988). This thesis showed that both 4 and 8-month-old infants could discriminate French from English when shown reversed language clips. Unlike auditory speech, reversed visual speech must conserve cues that permit language discrimination.
Infants’ abilities to distinguish languages using visual speech parallel auditory speech findings, but also diverge to highlight unique characteristics of visual speech. Together, these studies further enrich our understanding of how infants come to recognize and learn their native language(s).
|
2 |
Visual language discriminationWeikum, Whitney Marie 05 1900 (has links)
Recognizing and learning one’s native language requires knowledge of the phonetic and rhythmical characteristics of the language. Few studies address the rich source of language information available in a speaker’s face. Solely visual speech permits language discrimination in adults (Soto-Faraco et al., 2007). This thesis tested infants and adults on their ability to use only information available in a speaker’s face to discriminate rhythmically dissimilar languages.
Monolingual English infants discriminated French and English using only visual speech at 4 and 6 months old, but failed this task at 8 months old. To test the role of language experience, bilingual (English/French) 6 and 8-month-old infants were tested and successfully discriminated the languages. An optimal period for sensitivity to visual language information necessary for discriminating languages may exist in early life.
To confirm an optimal period, adults who had acquired English as a second language were tested. If English was learned before age 6 years, adults discriminated English and French, but if English was learned after age 6, adults performed at chance. Experience with visual speech information in early childhood influences adult performance.
To better understand the developmental trajectory of visual language discrimination, visual correlates of phonetic segments and rhythmical information were examined. When clips were manipulated to remove rhythmical information, infants used segmental visual phonetic cues to discriminate languages at 4, but not 8 months old. This suggests that a decline in non-native visual phonetic discrimination (similar to the decline seen for non-native auditory phonetic information; Werker & Tees, 1984), may be impairing language discrimination at 8 months.
Infants as young as newborn use rhythmical auditory information to discriminate languages presented forward, but not backward (Mehler et al., 1988). This thesis showed that both 4 and 8-month-old infants could discriminate French from English when shown reversed language clips. Unlike auditory speech, reversed visual speech must conserve cues that permit language discrimination.
Infants’ abilities to distinguish languages using visual speech parallel auditory speech findings, but also diverge to highlight unique characteristics of visual speech. Together, these studies further enrich our understanding of how infants come to recognize and learn their native language(s).
|
3 |
Visual language discriminationWeikum, Whitney Marie 05 1900 (has links)
Recognizing and learning one’s native language requires knowledge of the phonetic and rhythmical characteristics of the language. Few studies address the rich source of language information available in a speaker’s face. Solely visual speech permits language discrimination in adults (Soto-Faraco et al., 2007). This thesis tested infants and adults on their ability to use only information available in a speaker’s face to discriminate rhythmically dissimilar languages.
Monolingual English infants discriminated French and English using only visual speech at 4 and 6 months old, but failed this task at 8 months old. To test the role of language experience, bilingual (English/French) 6 and 8-month-old infants were tested and successfully discriminated the languages. An optimal period for sensitivity to visual language information necessary for discriminating languages may exist in early life.
To confirm an optimal period, adults who had acquired English as a second language were tested. If English was learned before age 6 years, adults discriminated English and French, but if English was learned after age 6, adults performed at chance. Experience with visual speech information in early childhood influences adult performance.
To better understand the developmental trajectory of visual language discrimination, visual correlates of phonetic segments and rhythmical information were examined. When clips were manipulated to remove rhythmical information, infants used segmental visual phonetic cues to discriminate languages at 4, but not 8 months old. This suggests that a decline in non-native visual phonetic discrimination (similar to the decline seen for non-native auditory phonetic information; Werker & Tees, 1984), may be impairing language discrimination at 8 months.
Infants as young as newborn use rhythmical auditory information to discriminate languages presented forward, but not backward (Mehler et al., 1988). This thesis showed that both 4 and 8-month-old infants could discriminate French from English when shown reversed language clips. Unlike auditory speech, reversed visual speech must conserve cues that permit language discrimination.
Infants’ abilities to distinguish languages using visual speech parallel auditory speech findings, but also diverge to highlight unique characteristics of visual speech. Together, these studies further enrich our understanding of how infants come to recognize and learn their native language(s). / Medicine, Faculty of / Graduate
|
4 |
A novel lip geometry approach for audio-visual speech recognitionIbrahim, Zamri January 2014 (has links)
By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset.
|
5 |
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech RecognitionMakkook, Mustapha January 2007 (has links)
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise.
|
6 |
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech RecognitionMakkook, Mustapha January 2007 (has links)
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise.
|
7 |
Refinement and Normalisation of the University of Canterbury Auditory-Visual Matrix Sentence TestMcClelland, Amber January 2015 (has links)
Developed by O'Beirne and Trounson (Trounson, 2012), the UC Auditory-Visual Matrix Sentence Test (UCAMST) is an auditory-visual speech test in NZ English where sentences are assembled from 50 words arranged into 5 columns (name, verb, quantity, adjective, object). Generation of sentence materials involved cutting and re-assembling 100 naturally spoken ‟original” sentences to create a large repertoire of 100,000 unique ‟synthesised” sentences.
The process of synthesising sentences from video fragments resulted in occasional artifactual image jerks (‟judders”)‒quantified by an unusually large change in the ‟pixel difference value” of consecutive frames‒at the edited transitions between video fragments. To preserve the naturalness of materials, Study 1 aimed to select transitions with the least ‟noticeable” judders.
Normal-hearing participants (n = 18) assigned a 10-point noticeability rating score to 100 sentences comprising unedited ‟no judder” sentences (n = 28), and ‟synthesised” sentences (n = 72) that varied in the severity (i.e. pixel difference value), number, and position of judders. The judders were found to be significantly noticeable compared to no judder controls, and based on mean rating score, 2,494 sentences with ‟minimal noticeable judder” were included in the auditory-visual UCAMST. Follow-on work should establish equivalent lists using these sentences. The average pixel difference value was found to be a significant predictor of rating score, therefore may be used as a guide in future development of auditory-visual speech tests assembled from video fragments.
The aim of Study 2 was to normalise the auditory-alone UCAMST to make each audio fragment equally intelligible in noise. In Part I, individuals with normal hearing (n = 17) assessed 400 sentences containing each file fragment presented at four different SNRs (-18.5, -15, -11.5, and -8 dB) in both constant speech-shaped noise (n = 9) and six-talker babble (n = 8). An intelligibility function was fitted to word-specific data, and the midpoint (Lmid, intelligibility at 50%) of each function was adjusted to equal the mean pre-normalisation midpoint across fragments. In Part II, 30 lists of 20 sentences were generated with relatively homogeneous frequency of matrix word use. The predicted parameters in constant noise (Lmid = 14.0 dB SNR; slope = 13.9%/dB ± 0.0%/dB) are comparable with published equivalents. The babble noise condition was, conversely, less sensitive (Lmid = 14.9 dB SNR; slope = 10.3%/dB ± 0.1%/dB), possibly due to a smaller sample size (n = 8). Overall, this research constituted an important first step in establishing the UCAMST as a reliable measure of speech recognition; follow-on work will validate the normalisation procedure carried out in this project.
|
8 |
The effect of auditory, visual and orthographic information on second language acquisitionErdener, Vahit Dogu, University of Western Sydney, College of Arts, Education and Social Sciences, School of Psychology January 2002 (has links)
The current study investigates the effect of auditory and visual speech information and orthographic information on second/foreign language (L2) acquisition. To test this, native speakers of Turkish (a language with a transparent orthography) and native speakers of Australian English (a language with an opaque orthography) were exposed to Spanish (transparent orthography) and Irish (opaque orthography) legal non-word items in four experimental conditions: auditory-only, auditory-visual, auditory-orthographic, and auditory-visual-orthographic. On each trial, Turkish and Australian English speakers were asked to produce each Spanish and Irish legal non-words. In terms of phoneme errors it was found that Turkish participants generally made less errors in Spanish than their Australian counterparts, and visual speech information generally facilitated performance. Orthographic information had an overriding effect such that there was no visual advantage once it was provided. In the orthographic conditions, Turkish speakers performed better than their Australian English counterparts with Spanish items and worse with Irish terms. In terms of native speakers' ratings of participants' productions, it was found that orthographic input improved accent. Overall the results confirm findings that visual information enhances speech production in L2 and additionally show the facilitative effects of orthographic input in L2 acquisition as a function of orthographic depth. Inter-rater reliability measures revealed that the native speaker rating procedure may be prone to individual and socio-cultural influences that may stem from internal criteria for native accents. This suggests that native speaker ratings should be treated with caution. / Master of Arts (Hons)
|
9 |
Perceptual Evaluation of Video-Realistic SpeechGeiger, Gadi, Ezzat, Tony, Poggio, Tomaso 28 February 2003 (has links)
abstract With many visual speech animation techniques now available, there is a clear need for systematic perceptual evaluation schemes. We describe here our scheme and its application to a new video-realistic (potentially indistinguishable from real recorded video) visual-speech animation system, called Mary 101. Two types of experiments were performed: a) distinguishing visually between real and synthetic image- sequences of the same utterances, ("Turing tests") and b) gauging visual speech recognition by comparing lip-reading performance of the real and synthetic image-sequences of the same utterances ("Intelligibility tests"). Subjects that were presented randomly with either real or synthetic image-sequences could not tell the synthetic from the real sequences above chance level. The same subjects when asked to lip-read the utterances from the same image-sequences recognized speech from real image-sequences significantly better than from synthetic ones. However, performance for both, real and synthetic, were at levels suggested in the literature on lip-reading. We conclude from the two experiments that the animation of Mary 101 is adequate for providing a percept of a talking head. However, additional effort is required to improve the animation for lip-reading purposes like rehabilitation and language learning. In addition, these two tasks could be considered as explicit and implicit perceptual discrimination tasks. In the explicit task (a), each stimulus is classified directly as a synthetic or real image-sequence by detecting a possible difference between the synthetic and the real image-sequences. The implicit perceptual discrimination task (b) consists of a comparison between visual recognition of speech of real and synthetic image-sequences. Our results suggest that implicit perceptual discrimination is a more sensitive method for discrimination between synthetic and real image-sequences than explicit perceptual discrimination.
|
10 |
Kalbančiojo lūpų formos registravimas / Speaker lip shape registrationKubickas, Egidijus 16 August 2007 (has links)
Kubickas E., Kalbančiojo lūpų formos registravimas: Elektronikos inžinerijos magistro darbas/mokslinis vadovas doc. dr. G. Daunys; Šiaulių universitetas, Technologijos fakultetas, Elektronikos katedra, – Šiauliai, 2007. – 67p. Magistro darbo tema yra aktuali, kadangi lūpos yra vienas svarbiausių vizualinės kalbos dalių. Komunikuojant žmonėms svarbiausi informacijos kanalai yra kalbos ir vizualiniai ženklai. Norint suprasti vartotojo instrukcijas triukšmingoje aplinkoje kompiuteris turi pasikliauti vizualiniais vartotojo ženklais. Šiame darbe atliekamas lūpų formos registravimas pasitelkus OpenCV ir Matlab 7.0 paketą. Pagrindinis tikslas yra sudaryti ir išanalizuoti algoritmą tiksliam lūpų formos registravimui. Ištyrinėta 4 žmonių lūpų forma (kontūrai). Remiantis gautais rezultatais matyti, kad lūpų registravimas susijęs su keletu problemų – sudėtingas lūpų registravimas dėl panašios veido spalvos, taip pat sudėtingas vidinių kontūrų radimas dėl matomų artikuliatorių. Tiksliausiai lūpų forma registruojama naudojant žalios spalvos komponentę (G). Gauti rezultatai gali būti panaudoti tolesniems tyrinėjimams, tokiems kaip neuroninių tinklų apmokymas lūpų padėčiai ar raidėms atpažinti. / Kubickas E. Speaker lip shape registration: Master thesis of electronics engineer/research advisor Assoc. Dr. G. Daunys; Šiaulių University, Technological Faculty, Electronics Department. – Šiauliai, 2007. – 67 p. The theme of Master Project of Electronics engineer is actual, because the lip is the one of the principal visual speech part. For better users instruction‘s understanding in the noisy environment, computer must rely on the visual users signs. In the theory section there is described the most important channels for communication - speech and visual signs. There is performing lip shape registration by using OpenCv and Matlab 7.0 package in this work. The approach proposed to make and to analize algorithm for more accurate lip registration. The main task is to investigate how to detect speaker lip shape robustly, when their location is different. There is presented research of four people lip‘s shape (contours). The results showed that there are few problems, related with lip registration – difficult lip extraction because of the similar skin colour, problematic lips inner contours extraction, because of the visible articulators such as teeth, tongue. Above all, that the main task was reached – there was received the robust lip registration by green colours component. Received results can be used for the further investigations such as different expression pictures research, the neural network for lip position or letters recognizing training.
|
Page generated in 0.0352 seconds