Speech recognition do not work well with noise. This article explores tests a noisy speech corpus against Google's recognizer.
After many decades of development, speech recognition can now be considered pretty much to be a solved problem. But there are some common situations where it does not work very well. Speech recognition does not work very well in noisy environments. Unfortunately mobile phones, where speech recognition is a convenient way to enter information, are often used in such conditions.
An MIT thesis from 2009 (by Tara Sainath) states that speech recognition has signficant problems with noise. Noise is generally described as signal to noise ration in terms of decibels. Sainath states that speech recognizers (at that time) typically had a 40% error rate on speech with 16 dB of noise. Meanwhile humans can recognize even 0dB speech with only 1% error rate. Thus in terms of handling noise, speech recognition has a long way to go to catch up with human recognition.
I have tried to classify noise in a paper on the Semisupervised classification of audio noise in 2010. The idea there was to train a speech recognizer on different types of noise. Then we could utilize the right type of speech recognizer if we can determine the type of noise. That approach did not go too far, since I was not able to classify all the noises accurately.
This article is about a simpler attempt. Even if a speech recognizer does not recognize everything, it may be still possible to reconstruct the intended input. We have tried this approach on correcting errors from the Google recognizer. I wanted to find out what happens if we tried to recover the intended input from whatever the Google recognizer manages to output. In another article Improving the Google speech recognizer we had described the procedure for recovering from some errors. I will apply the same technique here to some noisy speech samples.
As in the earlier paper I used the Noizeus corpus from the University of Texas at Dallas. The data is described in the paper: Hu, Y. and Loizou, P. (2007). “Subjective evaluation and comparison of speech enhancement algorithms,” Speech Communication, 49, 588-601. Please cite this reference if you publish anything using this corpus.
The data can be downloaded form the above link. It consists of thirty recordings in various forms. There is a clean version with no noise. The noisy samples are provided with eight types of noise and signal to noise ratios of 0dB, 5dB, 10dB and 15dB (the 0dB is the most noisy. The 15dB samples are the least noisy and provide the best speech recognition results. Thus altogether there are 990 wav files in the collection.
To use the sample program, download the sample code. This includes a directory called data. If you download the audio files as above, then place them in the appropriate directory in data/audio. To show where files should go, I have put one audio file, sp01.wav, in data/audio/clean.
This requires version 0.7 of the Jaivox library. You can download it here.
I tested whether the Google recognizer can recognize noisy speech by passing the sound files through the recognizer (as we had done earlier with a larger set of sentences.)
The code consists of two functions process.java and noisedata.java. The latter file creates recognition results for each audio file. Results for all audio files with the same noise type and noise level are placed into the same text file in data/asr. For example data/asr/restaurant_15.txt contains the results of processing each of the thirty noisy recordings containing restaurant noise.
Since it takes a long time to run each of the recordings through the Google recognizer, I have collected all of the results I obtained in appropriate files. I have not processed 0dB examples since even 5dB is mostly not recognized at all.
The table below shows the recognition errors for each type of data and each noise level. We first note the number of samples containing any error at all. Since many errors can be corrected with our phonetic matching method, the last column shows the number of errors after the errors are corrected in this way.
Without correction the results are pretty dismal. There are 30 sentences. For example with 15dB of airport noise 9/10 of them had at least one error. This is far worse than the 40% error expected.
We can correct some of the errors using a phonetic matching, knowing that each of the recordings are one of a known set of 30 sentences. With this correction, we get 6/30 errors, i.e. 1/5 or 20%. This is better than the expected 40%, but note that we are using some extra information about the recordings.
The full table is shown below.
The table shows that noisy speech is hard to recognize. With 5dB noise almost everything is incorrect, even after the above correction attempt. We can recover about 1/3 of the errors in the case of 10dB noise.