In this post, I am going to explain my work for gender identification and speaker recognition:
Librosa: A Python package for Music and Audio Analysis
SciKit-Learn: Machine Learing in Python
I use librosa to load audio files and extract features from audio signals. I choose it for now because it is a light-weight open source library with nice Python interface and IPython functionalities, it can also be integrated with SciKit-Learn to form a feature extraction pipeline for machine learning. This is enough for moderately complex tasks such as speaker recognition.
SciKit-Learn is used for training a UBM/GMM on MFCC features.
I trained a UBM with 32 Gaussian components on a dataset of standardised MFCC vectors extracted from speech signals by multiple female and male speakers.
For every standardised MFCC vector, it's probability in each Gaussian component is evaluated and put together as a feature vector for conceptor classifications. The reason for this it to refine the subspace using Gaussian components and the probabilities are in the range of [0, 1] already, there is no need for normalisation. These data are then fed into the generic Conceptor recognition framework.
This method makes decision for every MFCC vector (one every 512 ms), and one example results (of Gender Detection on a short female male conversation audio, 0 indicates female, 1 indicates male) look like this:
In the next step, I will try recognitions on spectrogram segments and use convolutional neural networks (CNN) to extract features from these segments and feed them to Conceptors.
To catch up with my planned schedule and provide a working solution for now, I implemented a GMM Speaker Recognition system with state-of-the-art performance. This system consists of the following part:
An energy-based voice activity detection and silence remove function.
A set of GMMs given by SciKit-Learn.
A speaker recognition interface, which includes the following functions:
enroll(): enroll new training data
train(): train a GMM for each class
recognize(): read an audio signal and output the recognition results
dump(): save a trained model
load(): load an existing model
The usage and performance of this interface is demonstrated in the following two Ipython notebooks:
Gender: a gender identifier with about a 5 mins training signal for each gender
Obama: a speaker recogniser with 7 mins training for Obama and 40 secs training for David Simon