Return to search

Decoding visemes : improving machine lip-reading

This thesis is about improving machine lip-reading, that is, the classification of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision. Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the video recording for example, the video resolution. We begin our work with a literature review to understand the restrictions current technology limits machine lip-reading recognition and conduct an experiment into resolution affects. We show that high definition video is not needed to successfully lip-read with a computer. The term 'viseme' is used in machine lip-reading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally defined, we use the common working definition: 'a viseme is a group of phonemes with identical appearance on the lips'. A phoneme is the smallest acoustic unit a human can utter. Because there are more phonemes per viseme, mapping between the units creates a many-to-one relationship. Many mappings have been presented, and we conduct an experiment to determine which mapping produces the most accurate classification. Our results show Lee's [82] is best. Lee's classification also outperforms machine lip-reading systems which use the popular Fisher [48] phoneme-to-viseme map. Further to this, we propose three methods of deriving speaker-dependent phoneme-to-viseme maps and compare our new approaches to Lee's. Our results show the sensitivity of phoneme clustering and we use our new knowledge for our first suggested augmentation to the conventional lip-reading system. Speaker independence in machine lip-reading classification is another unsolved obstacle. It has been observed, in the visual domain, that classifiers need training on the test subject to achieve the best classification. Thus machine lip-reading is highly dependent upon the speaker. Speaker independence is the opposite of this, or in other words, is the classification of a speaker not present in the classifier's training data. We investigate the dependence of phoneme-to-viseme maps between speakers. Our results show there is not a high variability of visual cues, but there is high variability in trajectory between visual cues of an individual speaker with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. Finally, we investigate how many visemes is the optimum number within a set. We show the phoneme-to-viseme maps in literature rarely have enough visemes and the optimal number, which varies by speaker, ranges from 11 to 35. The last difficulty we address is decoding from visemes back to phonemes and into words. Traditionally this is completed using a language model. The language model unit is either: the same as the classifier, e.g. visemes or phonemes; or the language model unit is words. In a novel approach we use these optimum range viseme sets within hierarchical training of phoneme labelled classifiers. This new method of classifier training demonstrates significant increase in classification with a word language network.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:687930
Date January 2016
CreatorsBear, Helen L.
PublisherUniversity of East Anglia
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttps://ueaeprints.uea.ac.uk/59384/

Page generated in 0.0019 seconds