Global ETD Search

Return to search

Feature Fusion Deep Learning Method for Video and Audio Based Emotion Recognition

In this thesis, we proposed a deep learning based emotion recognition system in order
to improve the successive classification rate. We first use transfer learning to extract visual
features and use Mel frequency Cepstral Coefficients(MFCC) to extract audio features, and
then apply the recurrent neural networks(RNN) with attention mechanism to process the
sequential inputs. After that, the outputs of both channels are fused into a concatenate layer,
which is processed using batch normalization, to reduce internal covariate shift. Finally, the
classification result is obtained by the softmax layer. From our experiments, the video and
audio subsystem achieve 78% and 77% respectively, and the feature fusion system with
video and audio achieves 92% accuracy based on the RAVDESS dataset for eight emotion
classes. Our proposed feature fusion system outperforms conventional methods in terms of
classification prediction.

10.25394/pgs.17161157.v1

Transfer learning

deep learning

Recurrent Neural Networks (RNNs)

MFCC

Emotion recognition

Identifer	oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/17161157
Date	20 December 2021
Creators	Yanan Song (11825003)
Source Sets	Purdue University
Detected Language	English
Type	Text, Thesis
Rights	CC BY 4.0
Relation	https://figshare.com/articles/thesis/Feature_Fusion_Deep_Learning_Method_for_Video_and_Audio_Based_Emotion_Recognition/17161157

Page generated in 0.0018 seconds

Feature Fusion Deep Learning Method for Video and Audio Based Emotion Recognition

Description

Links & Downloads

Tags

Additional Fields