Lorem ipsum dolor sit amet, consectetur adicing elit ut ullamcorper. leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet. Leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet.

  /  Project   /  Blog: Multimodal for Emotion Recognition

Blog: Multimodal for Emotion Recognition


Emotion Analysis

Go to the profile of Edward Ma
Photo by Vince Fleming on Unsplash

Emotion analysis is one of the challenge in this AI era. It can be applied to social media analysis, reviewing user conversations to understand audience. Understand audience emotion helps to improve communication effectiveness.

Input features are not only involving text only but also audio and video. You have to extract feature from text (e.g. text representation etc), audio features (e.g. MFCC, spectrogram etc) and visual features (e.g. object detection and classification). Researchers leverages all features to build a compressive model.

Trending AI Articles:

1. Making a Simple Neural Network

2. From Perceptron to Deep Neural Nets

3. Neural networks for solving differential equations

4. Turn your Raspberry Pi into homemade Google Home

This story will cover several researchers to talk about multimodal for emotion recognition with following experiment:

  • Multimodal Speech Emotion Recognition using Audio and Text
  • Benchmarking Multimodal Sentiment Analysis
  • Multi-modal Emotion Recognition on IEMOCAP with Neural Networks

Multimodal Speech Emotion Recognition using Audio and Text

Yoon et al. proposes dual recurrent encoder model which leverages both text and audio features to obtain a better understanding of speech data.

Audio Recurrent Encoder (ARE)

Mel Frequency Cepstral Coefficient (MFCC) features is provided to ARE. Given that every time t feeds MFCC feature to ARE and combing with prosodic feature to generate representation vector e. Applying softmax function to classify audio as A.

Text Recurrent Encoder (TRE)

On the other than, text transcript is used to generate text features. . Text is tokenized and converting to a 300 dimensional vectors. Given that every time t feeds text representation to TRE and applying softmax function to classify text as T.

Multimodal Dual Recurrent Encoder (MDRE)

Third mode combines both ARE and TRE result and applying a final softmax function to get the emotion category.

Dual Recurrent Encoder Architecture (Yoon et al., 2018)


You noticed that MDRE (combing ARE and TRE) achieve the best result. It shows that combing both text features and audio features to build a multimodal is better than a monomodal.

ARE did not good in classifying happy category while TRE did not good in classifying sad category. MDRE overcomes the limitation of both ARE and TRE.

Comparison result among ATE, TRE and MDRE (Yoon et al., 2018)

Benchmarking Multimodal Sentiment Analysis

Cambria et al. propose a method to include text, audio and visual feature to build their multimodal for emotion recognition. Given a video, there are 3 pipelines to extract features via convolution neural network (CNN) and openSMILE.

Text Features

Rather than using bag-of-word (BoW) approach, Cambria et al. use word2vec to get a meaningful text representation. In short, it is pre-trained via Google news. For out-of-vocabulary (OOV) scenario, those unknown words will be initalized randomly.

Word vectors will be concatenated according to sentence while the window is 50 words. These features will be feed into CNN to generate feature for multimodal.

Audio Features

Audio features is extracted by a famous library, openSMILE. Features is extracted in 10Hz rate and the sliding window is 100ms.

Visual Features

Unlike text and audio features, visual features is very large. Cambria et al. use every tenth frame and further reducing low resolution to reduce computing resource. After having visual frame, Constrained Local Model (CLM) is used to extract face outline to get the visual feature via CNN.


After getting features from text, audio and video, those vectors will be concatenated and leveraging SVM to classify the emotion category.

Multimodal Sentiment Analysis Architecture (Cambria et al.. 2017)


Same as previous result, more features leads to have a better result. Levering all text, audio and video, multimodal achieves the best result in IEMOCAP, MOUD and MOSI.

Comparison result among unimodal, bimodal and multimodal (Cambria et al., 2017)

Cambria et al. observes other patterns. First of all, they tried to compare the difference between speaker-dependent and speaker-independent learning. Experiment show that speaker-independent learning perform poor than speaker-dependent learning. It may due to lack of training data to generalize speaker utterance.

Multi-modal Emotion Recognition on IEMOCAP with Neural Networks

Tripathi and Beigi propose speech

Speech Based Emotion Detection

Same as other classic audio model, leveraging MFCC, chromagram-based and time spectral features. Authors also evaluate mel spectrogram and different window setup to see how does those features affect model performance.

Text based Emotion Recognition

Text model leverages GloVe to convert text to vectors and passing to multi CNN/ LSTM to train a feature.

MoCap based Emotion Detection

Motion Capture (MoCap) records facial expression, head and hand movements of the actor. Same as text, it will be passed to CNN/ LSTM model to train a feature.

Combined Multi-modal Emotion Detection

Same as aforementioned multidmodal, authors concatenate those vectors and using softmax function to classify the emotoin.

Model Architecture (Tripathi and Beigi, 2018)


The following figure shows that combing text, audio and visual features leads to achieve the best results. The model 6 version architecture is showed in above model architecture figure.

Performance comparison among models (Tripathi and Beigi, 2018)

Take Away

  • In general, text features provide a better contribution than audio and visual features among dataset and models. Having audio/ visual features are helping to further boost up model.
  • Although audio and visual features provide an improvement, it increases training complexity and errors.

Like to learn?

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.


Don’t forget to give us your 👏 !

Source: Artificial Intelligence on Medium

(Visited 3 times, 1 visits today)
Post a Comment