Blog: Multimodal for Emotion Recognition
Emotion analysis is one of the challenge in this AI era. It can be applied to social media analysis, reviewing user conversations to understand audience. Understand audience emotion helps to improve communication effectiveness.
Input features are not only involving text only but also audio and video. You have to extract feature from text (e.g. text representation etc), audio features (e.g. MFCC, spectrogram etc) and visual features (e.g. object detection and classification). Researchers leverages all features to build a compressive model.
Trending AI Articles:
This story will cover several researchers to talk about multimodal for emotion recognition with following experiment:
- Multimodal Speech Emotion Recognition using Audio and Text
- Benchmarking Multimodal Sentiment Analysis
- Multi-modal Emotion Recognition on IEMOCAP with Neural Networks
Multimodal Speech Emotion Recognition using Audio and Text
Yoon et al. proposes dual recurrent encoder model which leverages both text and audio features to obtain a better understanding of speech data.
Audio Recurrent Encoder (ARE)
Mel Frequency Cepstral Coefficient (MFCC) features is provided to ARE. Given that every time
t feeds MFCC feature to ARE and combing with prosodic feature to generate representation vector
e. Applying softmax function to classify audio as
Text Recurrent Encoder (TRE)
On the other than, text transcript is used to generate text features. . Text is tokenized and converting to a 300 dimensional vectors. Given that every time
t feeds text representation to TRE and applying softmax function to classify text as T.
Multimodal Dual Recurrent Encoder (MDRE)
Third mode combines both ARE and TRE result and applying a final softmax function to get the emotion category.
You noticed that MDRE (combing ARE and TRE) achieve the best result. It shows that combing both text features and audio features to build a multimodal is better than a monomodal.
ARE did not good in classifying happy category while TRE did not good in classifying sad category. MDRE overcomes the limitation of both ARE and TRE.
Benchmarking Multimodal Sentiment Analysis
Cambria et al. propose a method to include text, audio and visual feature to build their multimodal for emotion recognition. Given a video, there are 3 pipelines to extract features via convolution neural network (CNN) and openSMILE.
Rather than using bag-of-word (BoW) approach, Cambria et al. use word2vec to get a meaningful text representation. In short, it is pre-trained via Google news. For out-of-vocabulary (OOV) scenario, those unknown words will be initalized randomly.
Word vectors will be concatenated according to sentence while the window is 50 words. These features will be feed into CNN to generate feature for multimodal.
Audio features is extracted by a famous library, openSMILE. Features is extracted in 10Hz rate and the sliding window is 100ms.
Unlike text and audio features, visual features is very large. Cambria et al. use every tenth frame and further reducing low resolution to reduce computing resource. After having visual frame, Constrained Local Model (CLM) is used to extract face outline to get the visual feature via CNN.
After getting features from text, audio and video, those vectors will be concatenated and leveraging SVM to classify the emotion category.
Same as previous result, more features leads to have a better result. Levering all text, audio and video, multimodal achieves the best result in IEMOCAP, MOUD and MOSI.
Cambria et al. observes other patterns. First of all, they tried to compare the difference between speaker-dependent and speaker-independent learning. Experiment show that speaker-independent learning perform poor than speaker-dependent learning. It may due to lack of training data to generalize speaker utterance.
Multi-modal Emotion Recognition on IEMOCAP with Neural Networks
Tripathi and Beigi propose speech
Speech Based Emotion Detection
Same as other classic audio model, leveraging MFCC, chromagram-based and time spectral features. Authors also evaluate mel spectrogram and different window setup to see how does those features affect model performance.
Text based Emotion Recognition
Text model leverages GloVe to convert text to vectors and passing to multi CNN/ LSTM to train a feature.
MoCap based Emotion Detection
Motion Capture (MoCap) records facial expression, head and hand movements of the actor. Same as text, it will be passed to CNN/ LSTM model to train a feature.
Combined Multi-modal Emotion Detection
Same as aforementioned multidmodal, authors concatenate those vectors and using softmax function to classify the emotoin.
The following figure shows that combing text, audio and visual features leads to achieve the best results. The model 6 version architecture is showed in above model architecture figure.
- In general, text features provide a better contribution than audio and visual features among dataset and models. Having audio/ visual features are helping to further boost up model.
- Although audio and visual features provide an improvement, it increases training complexity and errors.
Like to learn?
I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.
- S. Yoon, S. Byun, and K. Jung. Multimodal Speech Emotion Recognition using Audio and Text. 2018.
- E. Cambria, D. Hazarika, S. Poria, A. Hussain and R.B.V. Subramaanyam. Benchmarking Multimodal Sentiment Analysis. 2017.
- Samarth Tripathi, Homayoon Beigi. Multi-modal Emotion Recognition on IEMOCAP with Neural Networks. 2018