Google's recognizer looks very good at first glance. However, the errors it makes are often so serious that it fails to convey the intended meaning.
Google has developed speech recognizers for many languages. I wanted to see how well this worked, especially for many languages spoken by the same person. Since I am interested in learning languages, I thought I would test Google's recognition in some of the world's most widely spoken languages.
There is a way to test the Google Recognizer online through a program, even though Google has not announced such a service. To recognize something, you have to submit an audio file in ".flac" format, along with a code indicating the language used in the recording. (For example, English spoken in the US is indicated by the code "en-US").
I wanted to see how Google's recognition compared across various languages. To do that, I recorded 20 sentences for each language into individual audio files. I spoke all of these sentences. The languages I chose are: English, French, German, Indonesian, Italian, Japanese, Mandarin Chinese, Portuguese, Russian and Spanish. English is basically my native language, and speakers of almost all of the other languages, out of their great kindness, claim that I speak their language well. (For example, almost every speaker of Mandarin Chinese I have ever met claims that I speak it with the prestigious Beijing accent. We'll have to see what happens in Beijing!)
The results of my test are quite impressive. (There are more details below on how you can test it yourself.) They seem quite reasonable with a formal comparison of the result with the original text using a version of "edit distance".
But even though the results look good at first, I found that the few errors are bad enough to impede understanding.
Sometimes the words that are falsely identified are mistaken for words that are not part of the language at all or are relatively uncommon. There are two especially extreme examples found in Mandarin Chinese and German:
There are other less extreme examples in many of the languages.
1. In Mandarin, the word for 'mister' or 'sir' (xian1 sheng1) is mistakenly recognized twice for the name of a particular subway station in Shanghai (xin1 zhuang1). Although these two words bear a vague resemblance, the word for 'sir' is of course much more common, and it makes more sense in context ("Hello, sir!") than the name of a subway station.
2. In English, the word 'bag' in the sentence 'please open your bag' is mistakenly recognized as 'back'. Again, these two words look and sound very similar, but only "bag" really makes sense in this context.
3. In Spanish, an entire sentence meaning 'she is going to Bogota' (va a Bogotá) is mistakenly recognized as the word meaning 'let's go' (vamos) followed by the word for 'for' (pa), even though this word only exists in some dialects of Spanish and very rarely occurs at the end of a sentence.
4. In Russian, the word for 'thousand' (tysjacha) was consistently recognized as the word for 'thirty' (trid'sat'). This error seems a little harder to understand than the ones mentioned above, and it is clearly a crucial one. For example, "98,000" comes out as a nonsensical string of numbers "90 8 30."
5. In Portuguese, a way of saying 'please' or 'excuse me' (faz favor) was mistakenly recognized as the word for 'paste' followed by 'good afternoon' (yielding "pasta boa tarde"). This error seems to be one of the hardest to understand. 'Please' and 'excuse me' are quite common expressions (as is "faz favor" in Portuguese), but does 'paste good afternoon' make any sense?
6. In Indonesian, the word for 'stamp' (pernangko) was mistakenly recognized as the word for 'ever' followed by a word for 'me' in some dialects of Indonesian (yielding "pernah ku"). While they do sound similar, Google's output simply does not make sense in this context, and "stamp" is quite a common word that should be recognized correctly.
7. In Japanese, Google heard a word meaning 'operation of a machine' (kadō) when it should have recognized the word for 'corner' followed by a particle (kado o). These two sound very similar, but 'operation of a machine' makes no sense in the given context (kado o magatte ikimashita 'went around the corner').
8. In French, the sentence "ma sœur y va" meaning 'my sister goes there' was mistakenly recognized as "ça marche Riva." "Ça marche" means 'it works' (and "Riva" is a proper noun), so while these phrases may sound somewhat similar, the resulting output from Google makes no sense.
9. In German, the recognition for sentence #4 (this is the fifth sentence; I am counting starting at 0 as in programming) was particularly horrible. The original sentence was:
Da ist auch ein Restaurant aber du gehst lieber in die Imbiss-Stube. (There is a restaurant there, too, but you'd rather go to the snack bar).
What Google recognized was the following garbled nonsense:
weißt auch eine Frau Autogas liebe dich ins Studio (literally translated: y'all also a woman gasoline love you to the studio)
As hard as it may be to believe, these do actually sound somewhat similar. However, clearly, Google completely failed to convey the intended message.
10. In Italian, at the beginning of the sentence "a Sergio piace mangiare bene" (Sergio likes to eat well), Google failed to recognize the initial "a." The resulting sentence makes no sense in Italian.
The recognition for Portuguese is horrible. I would be interested to see what happens when a native speaker's Portuguese is fed to the recognizer, with the native speakers being from both Portugal and Brazil.
In French, only sentences 10-11 are perfectly recognized (and line 15 is close enough). All other sentences have serious errors (even if they only have a few).
In German, only sentences #13 and 15-17 were perfectly recognized. In sentence #16, the word "sie" (meaning 'they' in this case) is entirely in lowercase (both times it occurs), although the intended meaning was with uppercase "Sie" (meaning 'you'). There is no phonetic difference between "sie" ('they') and "Sie" ('you') in German, despite the difference in meaning. However, that is just four sentences. The other 16 sentences were rather poorly recognized.
The Italian recognizer frequently fails to distinguish between words of one syllable that are stressed or unstressed. For example, it may confuse è 'he/she/it is' with e 'and'.
You can download the comparison test (it is 19 megabytes - includes all the audio files) to try these yourself. To run the tests, you need Java Development Kit (JDK). I used version 1.6, which you can download for free, for example from OpenJDK or from the Oracle Java SE download site.
Other than Java Development Kit, all you need is an internet connection.
To test, unzip webasrtest.zip. The unzipped version will contain three directories
The first directory contains a program TestLangAsr.java. This does not need any packages other than Java Development Kit. Compile this java program with
You can use this program to recognize any specific sentence, or all sentences for a particular language. For example, to recognize the second sentence in the collection of French recordings, enter
java TestLangAsr French 1
(Sentence numbers start at 0 - as in C and Java - so that 1 actually stands for the second recording.) Assuming that there is no trouble with the Google site and the network, you should see something like
fr-FR:01 distance:7 original: Moi je n'aime pas ça mais ma sœur y va une fois par mois recognized: moi j'aime pas ça marche rival une fois par mois
(Actual results may differ from these since Google may be improving the recognizer all the time.)
The distance: 7 is an approximate edit distance that I compute to compare the original text with the recognized text. This is based on the Levenshtein distance. I have modified it to compare characters in the case of Mandarin Chinese and Japanese, but to otherwise compare words. This is not the best comparison, for example, Google generally prints digits for numbers, while my original text generally has words for numbers.
You can also see all the results for a langauge. To see all results from, say, the Japanese recordings, you can enter
java TestLangAsr Japanese -1
Here -1 indicates that you want to try recognizing all of the Japanese recordings.
This will produce a display similar to the above, with three lines per sentence
ja-JP:00 distance:8 original: クラスは八時に始まります recognized: 倉沢82始まります ja-JP:01 distance:3 original: 町田さんはどこにいますか recognized: あきらさんはどこにいますか ja-JP:02 distance:1 original: 毎日 日本文化 日本語のクラスにします recognized: 毎日日本文化 日本語のクラスにします etc.
Here again the distance is the edit distance between the original and recognized strings.
Incidentally the results above are not identical to the results we got earlier. My descriptions use the earlier results obtained during the last week of May 2013.
You can test using your own recordings. I recorded the audio files using Audacity. I recorded the audio as ".wav" files and then converted them to ".flac" using Sox but you can also directly save portions of your recording or the whole recording in ".flac" format directly from Audacity. When creating ".flac" files using Sox, say for the file spanish_10.wav I used the command sox spanish_10.wav spanish_10.flac rate 16k.
For each language, I have a detailed list showing the results for each sentence. There is also an audio recording (in the ".wav" format you can play directly from most web browsers) for each sentence in each language. To check these out, please use the links below to each language.
I have included links below to details about each language results.
To get an idea of how the different languages are recognized, I created a table showing the distance between the original and the recognized versions for each sentence. The average here is an integer average, thus rounded down (for example 1.9 is rounded down to 1.)
This table shows that the average error is not too bad. But examining each result in detail, there are various problems.
(Note added July 22, 2013) I also ran these sentences though Google's text to speech system (TTS), and then submitted the TTS speech to the speech recognizer. Please see the line by line comparison.. The TTS speech is often not recognized at all (for example, for Indonesian.) In general, my recordings are recognized better than those produced by the TTS.
Google gives a rating for the accuracy of the recognition results. I did not use this rating ("confidence value"), but it could have helped to reject some of the poor results.
Some of the results did not seem to be grammatical. This could be improved if grammar was taken into account.
Even though we test each sentence independently, there are relationships between the sentences that could be exploited. This is also true of conversations, where the user is likely to talk about related things. This could be used to consider whether the recognized sentence is likely to occur in the context of all the other sentences in a conversation. That could further improve the results.
Overall, though, Google's recognizers are quite good. It is encouraging to see that they work reasonably well on a wide variety of languages. As someone who loves languages, this is a very welcome development as it opens new ways for people from different parts of the world to collaborate for both work and entertainment.