1 |
A novel lip geometry approach for audio-visual speech recognitionIbrahim, Zamri January 2014 (has links)
By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset.
|
2 |
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech RecognitionMakkook, Mustapha January 2007 (has links)
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise.
|
3 |
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech RecognitionMakkook, Mustapha January 2007 (has links)
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise.
|
4 |
Video Analysis of Mouth Movement Using Motion Templates for Computer-based Lip-ReadingYau, Wai Chee, waichee@ieee.org January 2008 (has links)
This thesis presents a novel lip-reading approach to classifying utterances from video data, without evaluating voice signals. This work addresses two important issues which are the efficient representation of mouth movement for visual speech recognition the temporal segmentation of utterances from video. The first part of the thesis describes a robust movement-based technique used to identify mouth movement patterns while uttering phonemes. This method temporally integrates the video data of each phoneme into a 2-D grayscale image named as a motion template (MT). This is a view-based approach that implicitly encodes the temporal component of an image sequence into a scalar-valued MT. The data size was reduced by extracting image descriptors such as Zernike moments (ZM) and discrete cosine transform (DCT) coefficients from MT. Support vector machine (SVM) and hidden Markov model (HMM) were used to classify the feature descriptors. A video speech corpus of 2800 utterances was collected for evaluating the efficacy of MT for lip-reading. The experimental results demonstrate the promising performance of MT in mouth movement representation. The advantages and limitations of MT for visual speech recognition were identified and validated through experiments. A comparison between ZM and DCT features indicates that th e accuracy of classification for both methods is very comparable when there is no relative motion between the camera and the mouth. Nevertheless, ZM is resilient to rotation of the camera and continues to give good results despite rotation but DCT is sensitive to rotation. DCT features are demonstrated to have better tolerance to image noise than ZM. The results also demonstrate a slight improvement of 5% using SVM as compared to HMM. The second part of this thesis describes a video-based, temporal segmentation framework to detect key frames corresponding to the start and stop of utterances from an image sequence, without using the acoustic signals. This segmentation technique integrates mouth movement and appearance information. The efficacy of this technique was tested through experimental evaluation and satisfactory performance was achieved. This segmentation method has been demonstrated to perform efficiently for utterances separated with short pauses. Potential applications for lip-reading technologies include human computer interface (HCI) for mobility-impaired users, defense applications that require voice-less communication, lip-reading mobile phones, in-vehicle systems, and improvement of speech-based computer control in noisy environments.
|
Page generated in 0.118 seconds