Automatic Speech Recognition Inference
The inference engine on this website allows you to test real time speech recognition inference on an ubuntu server instance with an Nvidia GPU. The model consists of 2 layers of 256 convolutional neurons, 3 layers of 512 bidirectional recurrent neurons, and a layer of time distributed dense neurons. The model was trained using Keras/Tensorflow on an Nvidia GTX 1070 GPU and deployed on a gunicorn/NGINX web server using flask in python.
I have applied to multiple jobs in data science in the pacific northwest but have so far been unsuccessful at finding somebody willing to take a chance on someone with a non-traditional background. I've heard a lot of lip service about diversity in the tech field, but my job search has shown such efforts to be few and far between. Microsoft's commitment to building accessible technologies, and diversity in the development of technologies is very much in line with my interest in building data driven intiatives for social good. I can help Microsoft build technologies to help people achieve their dreams and empower the people of the world with the data needed to make informed decisions about their life.
The initial idea for this project was to explore the possibility of building a speech recognition platform that could identify keywords in conversational speech, link them to therapeutic interventions and then conduct sentiment or emotion analysis to give real time feedback on client responses to cognitive behavioral talk therapy interventions. The project was initially built for the Nvidia Jetson Embedded Computing Platform for AI at the Edge and deployed at heyjetson.com but unfortunately, my puppy ate my Jetson. This serves as a less restricted extension of my Hey, Jetson! project, found at github.com/bricewalker/Hey-Jetson.
My goal was to build a character-level ASR system using a recurrent neural network in TensorFlow that can run inference on an Nvidia Pascal/Volta based GPU with a word error rate of <25%.
The primary dataset used is the LibriSpeech ASR corpus which includes 1000 hours of recorded speech. The final model was trained on a 960 hour subset. The dataset consists of 16kHz audio files between 10-15 seconds long of spoken English derived from read audiobooks from the LibriVox project. An overview of some of the difficulties of working with data such as this can be found here.
The training and production deployment server contains an Intel 7700k, overclocked to 4.8GHz with 32Gb ram clocked at 2400Hz, with an Nvidia GTX1070 clocked to 1746Mhz (1920 Pascal Cores).
Feature Extraction and Engineering
There are 3 primary methods for extracting features for speech recognition. This includes using raw audio forms, spectrograms, and mfcc's. For this project, I have created a character level sequence-to-sequence model using spectrograms. This allows me to train a model on a data set with a limited vocabulary that can generalize to more unique/rare words better. This comes at the cost of making a model that is more; computationally expensive, difficult to interpret/understand, susceptible to the problems of vanishing or exploding gradients as the sequences can be quite long.
Raw Audio Waves (pictured above)
This method uses the raw wave forms of the audio files and is a 1D vector where X = [x1, x2, x3...]
This transforms the raw audio wave forms into a 2D tensor where the first dimension corresponds to time (the horizontal axis), and the second dimension corresponds to frequency (the vertical axis) rather than amplitude. We lose a little bit of information in this conversion process as we take the log of the power of FFT. This can be written as log |FFT(X)|^2. The full transformation process is documented here.