Monday, June 22, 2015

Week 2 and Week 4: Gender Identification and Speaker recognition

In this post,  I am going to explain my work for gender identification and speaker recognition:

Toolkits used:

Librosa: A Python package for Music and Audio Analysis

SciKit-Learn: Machine Learing in Python

I use librosa to load audio files and extract features from audio signals. I choose it for now because it is a light-weight open source library with nice Python interface and IPython functionalities, it can also be integrated with SciKit-Learn to form a feature extraction pipeline for machine learning. This is enough for moderately complex tasks such as speaker recognition.

SciKit-Learn is used for training a UBM/GMM on MFCC features.

Data preprocessing:

I trained a UBM with 32 Gaussian components on a dataset of standardised MFCC vectors extracted from speech signals by multiple female and male speakers.

For every standardised MFCC vector, it's probability in each Gaussian component is evaluated and put together as a feature vector for conceptor classifications. The reason for this it to refine the subspace using Gaussian components and the probabilities are in the range of [0, 1] already, there is no need for normalisation. These data are then fed into the generic Conceptor recognition framework.

This method makes decision for every MFCC vector (one every 512 ms), and one example results (of Gender Detection on a short female male conversation audio,  0 indicates female, 1 indicates male) look like this:
, which is noisy and will not be very useful in practice as we usually want to have recognition decisions for longer periods (multiple seconds) and with less noises.  A simple frequency count does not solve this problem, since the noisy decisions will very often overwhelm the right decisions. One way to cope with this is to use mid-term statistics(mean, std, median, min, max) on short-term features, as I did in the demo code for Gender Identification I submitted before. This, although works, is not the best way since many information are lost during the statistics computations.

In the next step, I will try recognitions on spectrogram segments and use convolutional neural networks (CNN) to extract features from these segments and feed them to Conceptors.

To catch up with my planned schedule and provide a working solution for now, I implemented a GMM Speaker Recognition system with state-of-the-art performance. This system consists of the following part:
An energy-based voice activity detection and silence remove function.
A set of GMMs given by SciKit-Learn.
A speaker recognition interface, which includes the following functions:
enroll(): enroll new training data
train(): train a GMM for each class
recognize(): read an audio signal and output the recognition results
dump(): save a trained model
load(): load an existing model

The usage and performance of this interface is demonstrated in the following two Ipython notebooks:
Gender: a gender identifier with about a 5 mins training signal for each gender
Obama: a speaker recogniser with 7 mins training for Obama and 40 secs training for David Simon