Return to search

Audiovisual discrimination between laughter and speech

Laughter is clearly an audiovisual event, consisting of the laughter vocalisation and involving facial activity around the mouth. Past research on automatic laughter classification has focused mainly on audio-based approaches. In this thesis we integrate the information from audio and video channels and show that this fusion may lead to improved performance over unimodal approaches. We investigated different types of audiovisual fusion, temporal modelling and feature sets in order to find the best combination. A novel approach to combine audio and visual information based on prediction is also proposed, which explicitly models spatial and temporal relationship between audio and visual features. Experiments are presented both on matched training and test conditions, using subject-independent cross validation in one database, and unmatched conditions using 6 databases. This presents a challenging situation which is rarely addressed in the literature. Comparison of the different fusion approaches is performed on these databases, confirming that the prediction-based method proposed usually performs better than standard fusion methods. The lack of suitable data is a major obstacle in studying laughter so we introduce a new publicly available audiovisual database suitable for studying laughter. It contains 22 subjects which were recorded while watching stimulus material, by two microphones, a video camera and a thermal camera. An analysis of the errors of the audio, video and audiovisual classifiers is also performed in terms of gender, language, laughter types and noise levels in order to get an insight of when visual information helps. Finally, results on the first attempt to discriminate two types of laughter, voiced and unvoiced, in an audiovisual way are presented. Overall, it is demonstrated that in most cases the addition of visual information to audio leads to improved performance in laughter-vs-speech discrimination and audiovisual fusion is really beneficial as the audio noise levels increase.
Date January 2012
CreatorsPetridis, Stavros
PublisherImperial College London
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0026 seconds