Working with noisy speech

Speech recognition do not work well with noise. This article explores tests a noisy speech corpus against Google's recognizer.

By Vijay John April 14, 2014

After many decades of development, speech recognition can now be considered pretty much to be a solved problem. But there are some common situations where it does not work very well. Speech recognition does not work very well in noisy environments. Unfortunately mobile phones, where speech recognition is a convenient way to enter information, are often used in such conditions.

An MIT thesis from 2009 (by Tara Sainath) states that speech recognition has signficant problems with noise. Noise is generally described as signal to noise ration in terms of decibels. Sainath states that speech recognizers (at that time) typically had a 40% error rate on speech with 16 dB of noise. Meanwhile humans can recognize even 0dB speech with only 1% error rate. Thus in terms of handling noise, speech recognition has a long way to go to catch up with human recognition.

I have tried to classify noise in a paper on the Semisupervised classification of audio noise in 2010. The idea there was to train a speech recognizer on different types of noise. Then we could utilize the right type of speech recognizer if we can determine the type of noise. That approach did not go too far, since I was not able to classify all the noises accurately.

This article is about a simpler attempt. Even if a speech recognizer does not recognize everything, it may be still possible to reconstruct the intended input. We have tried this approach on correcting errors from the Google recognizer. I wanted to find out what happens if we tried to recover the intended input from whatever the Google recognizer manages to output. In another article Improving the Google speech recognizer we had described the procedure for recovering from some errors. I will apply the same technique here to some noisy speech samples.

Test data

As in the earlier paper I used the Noizeus corpus from the University of Texas at Dallas. The data is described in the paper: Hu, Y. and Loizou, P. (2007). “Subjective evaluation and comparison of speech enhancement algorithms,” Speech Communication, 49, 588-601. Please cite this reference if you publish anything using this corpus.

The data can be downloaded form the above link. It consists of thirty recordings in various forms. There is a clean version with no noise. The noisy samples are provided with eight types of noise and signal to noise ratios of 0dB, 5dB, 10dB and 15dB (the 0dB is the most noisy. The 15dB samples are the least noisy and provide the best speech recognition results. Thus altogether there are 990 wav files in the collection.


To use the sample program, download the sample code. This includes a directory called data. If you download the audio files as above, then place them in the appropriate directory in data/audio. To show where files should go, I have put one audio file, sp01.wav, in data/audio/clean.

This requires version 0.7 of the Jaivox library. You can download it here.

I tested whether the Google recognizer can recognize noisy speech by passing the sound files through the recognizer (as we had done earlier with a larger set of sentences.)

The code consists of two functions and The latter file creates recognition results for each audio file. Results for all audio files with the same noise type and noise level are placed into the same text file in data/asr. For example data/asr/restaurant_15.txt contains the results of processing each of the thirty noisy recordings containing restaurant noise.

Since it takes a long time to run each of the recordings through the Google recognizer, I have collected all of the results I obtained in appropriate files. I have not processed 0dB examples since even 5dB is mostly not recognized at all.


The table below shows the recognition errors for each type of data and each noise level. We first note the number of samples containing any error at all. Since many errors can be corrected with our phonetic matching method, the last column shows the number of errors after the errors are corrected in this way.

Without correction the results are pretty dismal. There are 30 sentences. For example with 15dB of airport noise 9/10 of them had at least one error. This is far worse than the 40% error expected.

We can correct some of the errors using a phonetic matching, knowing that each of the recordings are one of a known set of 30 sentences. With this correction, we get 6/30 errors, i.e. 1/5 or 20%. This is better than the expected 40%, but note that we are using some extra information about the recordings.

The full table is shown below.

Type SNR Errors After correction
airport 15dB 27 6
babble 15dB 27 7
car 15dB 28 9
exhibition 15dB 28 9
restaurant 15dB 27 6
station 15dB 28 7
street 15dB 28 11
train 15dB 29 11
airport 10dB 29 18
babble 10dB 29 21
car 10dB 29 22
exhibition 10dB 29 19
restaurant 10dB 30 20
station 10dB 30 20
street 10dB 29 19
train 10dB 30 23
airport 5dB 30 30
babble 5dB 30 30
car 5dB 30 29
exhibition 5dB 30 29
restaurant 5dB 30 28
station 5dB 30 29
street 5dB 30 30
train 5dB 30 30

The table shows that noisy speech is hard to recognize. With 5dB noise almost everything is incorrect, even after the above correction attempt. We can recover about 1/3 of the errors in the case of 10dB noise.