Improving the Google Speech Recognizer

The Google Speech Recognizer works remarkably well. But it still makes enough errors to impede its use. This article describes a simple way to post-process the recognizer's output to get much better results.

By Vijay John and Thomas John March 10, 2014

A recent article, Speech Recognition, the unreachable frontier, by James Kendrick notes that speech recognition is still not good enough for common use. This may explain why many people seem to stop using automatic speech recognition for complex tasks after the novelty wears off. While this observation presumably focused on English, last year Vijay analyzed Google's speech recognition for ten languages and found that even though many individual words are recognized correctly, Google's system makes enough errors to hinder the recognition of whole sentences. Earlier last year, in Benchmarking Google's Speech Recognition Web Service, Julius Adorf examined recognition accuracy for 1444 audio recordings from a Harvard database of 720 sentences (Wikipedia description). He found that only about 20% of the sentences are transcribed correctly by the speech recognizer. (We should note that the Google Recognizer is improving all the time, thus the same tests may get better results now.) Julius Adorf found that about 80% of the words were recognized correctly, however too many sentences had at least one error. This may explain why many people do not like to use speech recognizers for routine tasks.

Even though these observations suggest that speech recognizers are not reliable, we feel that some simple tactics can improve their performance dramatically. This is especially true for applications with speech interfaces, where the recognizer generally has to identify commands, based on the application's functions, only from a limited set of sentences. In contrast to the observations above, we have found that the results from the Google recognition may be processed using a simple method to get nearly perfect results. For example, on the list of sentences from the Harvard database, we can correctly identify the intended sentence from the somewhat garbled recognizer output in over 99% of the cases. (Thanks to Julius for sharing his data with us.)

While this test focuses on the Google recognizer, the method applies to all speech recognizers. If a recognizer identifies most of the words correctly (Google's recognizes about 80% of words in the test described above), then the method described here will probably be useful. Even though the test here is based on recognizing the right sentence from a list of 720 sentences, there are ways to overcome this limitation, as we note later in this article.

Recipe for improving recognition results

The data from Julilus Adorf's test involve a limited set of many hundreds of strings. It is possible to extend this method to unlimited texts; this is described later in this article.

The following method works in situations where a speech recognizer needs to identify the right one from a list of strings.

  1. Download and install the Jaivox library version 0.7 (i.e. make the classes in the library available in your classpath.)
  2. Collect the list of all the questions to be recognized into a string array.
  3. Create a com.jaivox.interpreter.PhoneMatcher using this array of strings.
  4. For any recognized string, call PhoneMatcher.findBestMatchingSentences, this returns a list of results.
  5. Usually the first result from this will be the corrected version of the recognized string.

An implementation of this using the Harvard sentences can be found in the Jaivox 0.7 download in the apps/fixerrors directory.

How it works

The method is just an application of the well-known Edit Distance. The main variation is that edit distance is not be applied to the text of two strings. Instead, the strings are converted to sequences of phonemes (sound units) and the edit distance is computed on these sequences.

We fixed some parts of the data before running our tests. The data table contained four sentences (sentences 64, 94, 109 and 132 - we start counting from 0) where the recognition matched some other sentences nearly perfectly. We assumed that these were due to transposed entries and corrected them. Another correction was for numerical results, since the recognized results generally contained numerical characters. We converted these numbers to words since the original text wrote out these numbers as words. The corrected data was then converted to phonemes.

Text can be converted to phonemes using a Phonetic Decomposer. All speech recognizers contain some phonetic decomposition method, though we do not have access to the one used by Google. Usually speech recognition involves finding sequences of phonemes (along with some likelihood estimates). This information is used to find the words that could have produced these phonemes. We do not have access to the original phonemes that were considered most likely for the Harvard sentences. But we can find the phonemes corresponding the words that are recognized and those corresponding to the original sentences.

In many languages, you can find the phonemes from text using a set of rules. We have demonstrated this in the past using a decomposition method along with some rules to find phonemes for twenty two languages. English is harder, we need to consider a lot of different pronunciation variations in English.

For English, one way to get phonemes is to use a phonetic dictionary like The CMU Pronouncing Dictionary. For each word in both the recognized sentence and the original sentences, we can replace the word with the phonemes given in this dictionary. However, not all words will be in the dictionary. This problem can be handled with a Text to Phoneme Converter that can use patterns from the CMU dictionary to guess the phonemes of words not in the dictionary. One inconvenience of using this Text to Phoneme converter is that it is implemented in Perl. We have ported this to Java within the Jaivox library. Thus you can get the phonemes corresponding to English strings using a TextToPhoneme class within this library.

Using the text to phoneme decomposition method, we can try to find the right matches for recognized strings.

  1. Create two arrays correct contains the correct versions of each sentence in the list, recognized contains the recognized strings corresponding to each of the correct sentences.
  2. Use TextToPhoneme to obtain two arrays: correct_phones contain the phonemes corresponding to each of the correct sentences, and recognized_phones contains the phonemes for the recognized strings.
  3. For each string of phonemes recognized_phones [i] find the string from correct_phones [j] that is closest to it in edit distance.
  4. Check whether the correct [j] entry corresponding to the closest correct_phones [j] is equal to correct [i]. Count any situations where these are not equal as an error.

We found that this method detects five errors (including one instance where both the correct and a mistaken selection are at the same edit distance). Thus in 1439 of the 1444 sentences in the list, this method found the right match from the recognized string. Thus, instead of finding the right result in 20% of the sentences originally, we have the right result in 99.65% of the time.

Practical aspects of using this method

As noted, we get the right result in all but five of the cases. Including the one mentioned above, there are confusions in two of the cases in the sense that there are multiple results with the same edit distance.

In a real application, we do not always know if we got the right result. We can examine the results to see if the answer could be the right one. For example, one simple way is to see if the result and the recognized strings have common words. However, we usually find it useful to compare the recognized and the result strings in terms of their phonemes. We can look at how the phonemes in the recognized string correspond to phonemes in the matched string. For example, one of the cases where the edit distance does not work well is sentence 910 a cold dip restores health and zest. Google recognizes this as ecole de personas how to invest which is not even all in English. The closest edit distance is 14, to the sentence she called his name many times.

The table below compares the phonemes in the recognized string ecole de personas how to invest and the closest sentence she called his name many times.

ix k ow l d p er s ax n ax s t ow ix n v eh s td
sh k ao l d ax     s     n ax m m ax     n iy t ay m z

We can see how the phonemes line up between the recognized string (top row) and the closest (in edit distance) string (bottom row). The green blocks are matched phonemes, the pink ones are phonemes in the recognized string that are not matched, and the blue blocks are phonemes in the closest string that are not matched properly with phonemes in the recognized string. Notice that there are some blank blocks in the lower row, this represents the fact that in certain gaps, the closest string has to insert blanks to line up with the recognized string. Generally this is an indication that the closest string may not be the right one.

We can make a similar comparison between the recognized string and another candidate, which happens to be the correct one a cold dip restores health and zest. The edit distance here is 15, while the lowest edit distance was 14 as noted above. But in this case, the phonemes in the recognized and candidate string line up better, even though the edit distance is larger.

ix k ow l     d     p er     s ax n ax s t ow ix     n v eh     s td
ax k ow l d d ih p r eh s t ao r s eh l th ae n d z ix s td

In this example, we generally see better alignment between the two strings. You can measure the alignment as above by using some evaluation methods that are not exactly the same as edit distance. We can also use dynamic programming to get good fits between the recognized and matched strings, but these methods are more complex. Notice also that while there are some blank spaces in the recognized string this is more likely than blank spaces in the result since speech recognizers often miss some of the sounds that are said (and sometimes the fault is really with the speaker who does not speak all the sounds correctly.)

Alignment measurements may be complicated, but you do not have to use them. Instead you can simply reject matches where the edit distance is too large. For example, if we consider 10 to be the maximum edit distance to be tolerated, we get 11 situations that exceed this distance. The number of instances that exceed edit distance of 11, 12, 13, 14, 15 and 16 are respectively 8, 8, 7, 6, 6 and 5. Thus even if we treat instances with edit distance over 10 to be wrong we do not get too many errors. In practice, instead of an arbitrary edit distance, it is best to discard results where the edit distance is larger than some proportion of the number of recognized phonemes. If we discard results where the edit distance is more than three fourths of the number of recognized phonemes, in addition to the five errors we get three suspect cases where the edit distance is too large. Thus in most cases, in the data considered here, the edit distance is not too large. This suggests that in some applications, if the edit distance is small, errors in recognition can be corrected without prompting the user. This will make it easier for people to work with recognizers since users generally do not like being prompted for confirmation of their commands.

Even though this example deals with English recognition, we expect similar results in other languages. We already have a method for creating phonetic decompositions from text in various languages.

One convenience of this method for application developers is that the results are improved through post-processing. This does not require modifying the closed source recognizers like Google's. However, it would be much better if we had access to the sequence of phonemes produced by the recognizer. For various languages, we have used Sphinx 3, an open source recognizer, that can produce phonemes as output. This was used in our tests involving various languages where we were able to get reasonable results for recognizers trained on just a few hundred sentences.

Going beyond limited texts

In a voice-enabled application, a user is expected to ask the application to do something. This usually limits the user's commands to a limited set of strings (here, limited may mean a few thousand options.) For example, the OK Google support page lists just a few commands (with some possible variations.) In voice-supported interactive voice response systems, the number of things that the user may be able to say are similarly limited (for example a travel application may let the user say "I would like a flight to Boston on the 22nd"m with many options for destinations and dates). Many form-letters are made up from combinations of a few thousand sentences. Therefore the correction method above may be adequate for handling voice input of these letters.

In some applications, it may look like a speech recognizer needs to handle unlimited text, but there are ways to restrict the options. One such application involves generating subtitles for video where the script is available in text. This alignment problem needs to consider only limited parts of the script at each point in the corresponding audio. Another situation involves voice applications where the number of options are limited by context. For example, in a medical application where a patient talks to a virtual doctor, voice recognition can limit the number of things it needs to understand based on a current state of diagnosis.

Even though many situations involve limited texts, some applications need to go beyond this. We can extend the correction method here by augmenting it with some search functions. If you look at a lot of examples of bad recognition results (as we have), you will see that a large part of the problem is with semantics. Good recognizers such as Google's do recognize most of the words correctly. But often, they put the words together into sentences that do not make sense.

For example, consider the first sentence in the test data, Tuck the sheet under the edge of the mat. Google recognized this as Tuck the sheet under age of the night. The sentence recognized by Google does not make sense because that sentence does not describe a situation that would happen in the real world.

There is a way to see if a sentence may have no real-world meaning; we could look for that sentence online. When we searched for the phrase "sheet under the age of the night" online, we could not get any matches, while we could get matches for "sheet under the edge of the mat."

Note that online search is not a foolproof technique for determining whether a particular sentence describes a plausible real-world situation. Search engines ignore some results that may be on the web, so if a search for a particular phrase yields no results, that doesn't guarantee that the phrase is meaningless. Also, a search could match an article, such as this one, containing a phrase that describes an implausible situation.

However we can create specialized searches to find meaningful sentences that match some of the words or phonemes that are recognized, instead of simply stringing words (or trigrams of words) together. We have developed some methods to do this. Since each recognizer produces sequences of phonemes (along with some probabilities) some of these methods can be applied at the last stage of recognition before words and sentences are produced from the recognizer. But even with incorrectly produced words, it is often possible to find what the user intended to say. Once we limit the number of sentences that could make sense in a given context of words, we can proceed to correct some of the errors using the methods described here.