Comparing Google recognition with Sphinx4

This article uses a Java program that requests Google recognition for audio files.

It also includes a similar program to recognize the same audio using Sphinx4. With these two programs, we can compare the recognition results for Google and Sphinx4.

The two Java programs and required data files can be downloaded here. They are also included as part of the Jaivox library, see downloads. The audio files included here include both WAV and FLAC format files. (The WAV files were converted to FLAC using Sox.)

The Google program does not have any idea about the topic of the statements here. The recordings actually are from our application examples. The Sphinx program uses that information. It has a language model road.arpabo.DMP included with the downloads that tells Sphinx about the limited number of strings that can be understood by the application.

The test shows that In the situation with limited vocabulary, as here, Sphinx performs slightly better than the more general Google recognizer. Details of this test are given below.

Setting up the experiment

Google recognition is not through an official API, but rather through a request that some programmers have found. Google does not recognize the WAV format generally used with Sphinx4. Part of the process of recognizing WAV files with Google involves converting the WAV files to the FLAC format. If you do not want to bother with Java, there is a simple shell script that you can use to recognize audio from WAV files.

The downloaded files contain an audio directory. There are 33 audio recordings. Each recording is given in both the WAV and the FLAC formats. The Java programs are compiled with javac *.java assuming that you have the Sphinx4 class files in your CLASSPATH..

Running the programs

To run the Google recognizer program

java testGoogle

This will read each FLAC file from the audio directory, send it to a Google URL using an HTTP POST request and gets back a JSON-format response. The response is then searched to find the recognized string.

The results are then printed, one line per input file, as follows

001	does the road gets slow at this time
002	the roads get congested at this time
003	are roads busy
004	do you think the roads are slow
005	do you think Elmwood Avenue slow
...

(There should be 33 lines in the output.)

Similarly, you can run the test using Sphinx4 with

java testSphinx

This also prints out the results one line for each file

001	do the roads get slow at this time
002	do the routes get congested at this time
003	are roads busy
004	do you think the roads are slow
005	do you think elmwood avenue is slow
...

Examining the results

There are three text files included with the download.

  1. recorded.txt contains the actual text that was spoken to make the recordings.
  2. google.txt contains the result from running testGoogle
  3. sphinx.txt consists of the results form running testSphinx

We can compare the original recording and the two recognized results using an edit distance measurement. For example, you can use the Levenshtein distance comparing two sentences by comparing their words. The program compare.java included in the download does this comparison.

The comparison gives the edit distance of the Google result and the Sphinx result for each of the recorded sentences in recorded.txt. The results look like the following.

001	3	0
002	2	1
003	0	0
004	0	0
005	1	0
...

For example this table says that on the first file, Google had three variations from the original sentence while Sphinx had none. There are later exmaples where Sphinx has more errors than Google. But if we add up all the errors for both Google and Sphinx, we see that on the 33 examples, Google had a total of 29 errors while Sphinx had a total of 22 errors. We compare all the strings after converting to lower case, hence the errors are not based on case sensitivity.

There are some errors which are not very serious. For example, Google may say "oh" instead of "O" (as in Avenue o). Google may say "free way" while Sphinx may be limited in the vocabulary it can use and hence say "freeway." The above count of errors however ignore these nuances.