Testing Google's Text To Speech

Google's text-to-speech (TTS) system produces speech in various languages that is pronounced well for the most part. However, it has problems in every language, especially with pitch, so the speech does not sound natural.

By Vijay John July 16, 2013

Google has TTS systems for various languages. However, like other TTS systems, Google's TTS systems have various problems, some of which are language-specific. The conversion from text to speech is especially bad for Indonesian; while stress was correctly modeled in most cases, every syllable sounded like a nonsense syllable being pronounced by a native speaker of American English. It sounded as if these nonsense syllables were simply strung together.

However, there are important problems in other languages as well, often having to do with pitch. In Mandarin Chinese, where pitch is especially important, tones were mispronounced fairly frequently, and stress was incorrectly modeled. Similarly, the pitch patterns in Japanese were often incorrect, even though Japanese is another language where pitch is especially important (although in a different way from Mandarin Chinese).

Examples of problems with pitch in each language include the following:

1. In the Mandarin Chinese sentence “我已经订了房间” (wo3 yi3jing1 ding4 le fang2jian1 ‘I’ve already booked a room’), the first three syllables are all pronounced with the wrong tone. In isolation, 我 wo3 ‘I’ might be pronounced with a tone that falls and then rises, although in practice, what I have usually heard from Mandarin Chinese-speakers is simply a low flat tone. However, in this context, it would probably be much more appropriate for the tone to be rising slightly. In any case, what Google produces is neither of these but rather a flat tone that has a slightly (but inappropriately) higher pitch. This is also what it produces for the next syllable. The last of the three syllables is pronounced with a rising tone (*jing2) instead of the high flat tone (jing1).

2. The English sentence “May I have two pounds of apples?” is pronounced with falling intonation at the end, as if it wasn’t a question. There are a few other things that are strange about this sentence. One is that the vowel in “have” is NOT the same as the vowel in “apples” (yes, there is only one vowel in this word). The other is true of all of the English sentences: the voice sounds creakier than most human voices would be (and it does not sound e.g. like an old person).

3. The intonation for the Spanish sentence “los pasajeros están pasando por el control de seguridad” (the passengers are going through (airport) security) is very odd. The first word is pronounced with a pitch that is too high, the last word is stressed on the wrong syllable, and some of the words in between are pronounced in such a way that it is not clear where the stressed syllable in these words is.

4. In the Russian sentence “да, сегодня доллар опять падал” (yes, the dollar fell again today), the intonation is problematic (as with most of the other sentences in all of these languages), the wrong word is emphasized, and there should be a pause after the first word. In addition, there are several pronunciation mistakes.

5. In the Portuguese sentence “faz favor, pode me dizer onde é a saída?” (can you please tell me where the exit is?), for some strange reason, the word “a” (the) is emphasized considerably more than the previous word “é” (is), even though the other way would make considerably more sense (especially since é is a stressed syllable).

6. In the Indonesian sentence “boleh saya bantu?” (can I help (you)?), there is no indication of the correct pitch pattern or pronunciation, apart from perhaps the pitch being particularly high on the second-to-last syllable of the sentence (which is stressed in some varieties of Indonesian). Instead, it sounds like it is made up of recordings of a male native speaker of American English producing nonsense syllables that were then strung together automatically (“bow-lay-sah-yah-bon-too”).

7. In the Japanese sentence “クラスは八時に始まります” (kurasu wa hachi-ji ni hajimarimasu ‘the class starts at eight o’clock’), the pitch sounds like it is bobbing up and down too much to sound like Japanese as spoken by a human. The phrase 八時に (hachi-ji ni ‘at eight o’clock’) is not emphasized even though it is the key phrase here. It should probably be somewhat high on the second syllable of the first word, and highest on the second syllable of 八時 (hachi-ji ‘eight o’clock’).

8. The pitch patterns in the French sentence “il fait souvent beau et tu peux choisir des activités sportives ou culturelles” (the weather is often beautiful, and you can choose athletic or cultural activities) are very odd. The pitch rises towards the end of the sentence even though this is a statement and not a question. It should not begin to rise until the third or fourth word, and it should also rise on the words “sportives” (athletic) and probably “culturelles” (cultural). (There seems to be a pronunciation mistake here as well, since the final “s” in “sportives” is pronounced when it shouldn’t be).

9. The pitch patterns are also odd in the German sentence “Besucher aus aller Welt kommen hierher um sich die wertvolle Kunstsammlung anzusehen” (visitors from all over the world come here to view the valuable collection of art). There should be more emphasis than there is on “aller Welt” (all over the world), and the pitch is too high on the first syllable of “wertvolle” (valuable) and should not be rising on that of “hierher” ((to) here).

10. In the Italian sentence “frequenta un corso di specializzazione” (she is taking a specialization course), the pitch pattern for first word should be about the same as the pattern for the two words that follow it, but it isn’t. Furthermore, in the last word, the pitch should be highest on the second-to-last syllable, but it is instead highest on the syllable before that. The first word is pronounced too slowly, and the “zz” in the last word is mispronounced.

Comparison of results by language

For each language, I have tested 20 sentences. I have listed each sentence along with comments on the problems with the speech produced by the Google Text to Speech component.

English French German Indonesian Italian
Japanese Mandarin Portuguese Russian Spanish

We also passed the TTS results through Google's speech recognizer and compared the results to a human speaker. There are places where the TTS's speech is recognized better than the human speaker, but there are many situations where the TTS's speech is not recognized at all.

Even though I note problems, Google's TTS works much better than various other tools such as Festival and Espeak.