Sunday, July 19, 2015

Emotion Recognition and Tone Characterization by DNN + ELM

Let's face it, emotions are tricky!

Method

After struggling with emotion recognition for long time, I decided to implement the method proposed by Microsoft Research last year in the Interspeech conference, an approach using a deep neural network(DNN) and extreme learning machine(ELM). In particular, a DNN was trained to extract segment-level (256 ms) features and ELM was trained to make decisions based on the statistics of these features on a utterance level.

The training was done using the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database from here.

Experimental results in the paper demonstrated that this approach outperforms the state-of-the-art, including OpenEAR.

Here you can also find a video presentation by the author Kun Han himself.

My implementation

all code can be found here

Dependency:

python speech features library: used to extract MFCC features from audio files, which can be replaced by the common feature extraction component once the Red Hen Lab Audio Analysis pipeline is established. But to run the current code, this library must be included in the working folder.

PDNN: A Python Toolkit for Deep Learning, which is used to extract segment-level features by deep neural network. 

Python files:

energy.py
Takes a speech signal and returns the indices of frames with top 10% energy.

Given two audio folders (training and validation, see "folder structure" for the structures of these folders), extracts the segment-level features from audio files in these folders for DNN training.

Given one (testing) audio folder, extracts the segment-level features from audio files in the folder for DNN feature extraction.

Train ELM with the probability features extracted by DNN.

Annotate the recognition results of the test files into Results.txt

Folder Structure:

wav_train (training folder):
has as many subfolders as there are emotions(or tones) to be recognized, each subfolder corresponds to one emotion (or tone) and has one utterance as one separate .wav audio file inside it.

wav_valid (validation folder):
has as many subfolders as there are emotions(or tones) to be recognized, each subfolder corresponds to one emotion (or tone) and has one utterance as one separate .wav audio file inside.

wav_test (testing folder):
has one utterance as one separate .wav audio inside. These files will be annotated and the results are written in Results.txt

Shell Scripts:

Extract features and train DNN and ELM. You can specify the pointer to PDNN, the training folder, and validation folder in this file.

Extract features of test files and recognize the emotions of utterance files in test folder. You can specify the pointer to PDNN and the testing folder here.

Intermediate Files:

dnn.param
contains the parameters of the trained deep neural network

dnn.cfg
contains the configurations of the trained deep neural network

ELMWeights.pickle.gz
contains weight parameters of the extreme learning machine

LabelNumMap.pickle.gz
contains the mapping between the emotion string labels in training/validation folders and integer labels during DNN training.

Output File:

Results.txt
the recognition results

Usage Example:

To train a new emotion recognizer (after specifying the training and validation data):
./train.sh

To predict emotions on new audio files with trained model: 
./test.sh