Implementing a speech command interface

We have a new application for issuing Linux commands through speech. It can be used for other operating systems too by changing the actual commands that are executed. The example in this article handles only a few commands, but it is easy to add other commands. Even though the application is for Linux, it can be modified for other operating systems such as Windows.

This application is called talkbash. You need Sphinx4 from CMU as well as Sphinxbase for building a language model. You can download talkbash.zip which includes the application source code.

The talkbash program issues bash shell commands in response to spoken instructions. A text file commands.txt specifies the things recognized by the program (actually by Sphinx) as well as the commands that are executed as a result.

For example, you could say

"where is fire fox?"

and the bash command which firefox will be run and the results such as /usr/bin/firefox will be printed out.

The commands.txt file included here lists only 10 sentences. You can add more sentences and corresponding commands. The spoken sentence and the command are separated by a pattern "xxx" (three lower case "x" s) with at least one blank space to both sides of this pattern. You can change the pattern used here in case you need "xxx" for some command (can't think of any though.)

How to create a language model

Any speech recognition application using Sphinx needs an Audio Model (you can use one that comes with Sphinx4) and a Language Model. The Language Model (or "lm") tells sphinx about the words and combinations of words that may occur in your speech.

Instructions for creating a language model

  1. You need a list of sentences in your language model. In the talkbash example, the sentences are from the file commands.txt, but only the parts of each line prior to the "xxx" that separates the spoken command from the bash command. The talkbash program will create a file commands.sent when you first run it, this can be used for the next step.
  2. Once you have a file of sentences like commands.sent you can use either the downloaded CMU-Cambridge Language Modeling Toolkit or a newer online version. The rest of the instructions here assume you are using the newer online version.
  3. Upload commands.sent to the Online Language Modelling site, press the button called COMPILE KNOWLEDGE BASE, and wait for a new page that has links to several files.
  4. Among the various links in the new page, there will be a link to a file with extension ".lm". This is the language model and the only file you really need for the talkbash program. The name of this file is generally a number followed by ".lm". The number changes each time you compile the knowledge base. For now, assume that the file is 4999.lm.
  5. The language model you downloaded is an ASCII file, but Sphinx4 needs a binary file called a dump file with extension DMP. This is where you need sphinxbase. Sphinxbase includes a program called sphinx_lm_convert. In this case you can create a dump file called commands.DMP with
    sphinx_lm_convert -i 4999.lm - o commands.DMP.
  6. The commands.DMP file is what you need for the language model.

Using talkbash

Before using talkbash, you need to install Java, Sphinxbase and Sphinx4. If you have some difficulty with these, perhaps installation problems may provide some answers.

Ideally you should use a noise-cancelling headset for speaking to talkbash. We get the best results this way. If your computer's sound processing is very good you may not need this. One caveat though, it may be a good idea to use a program like Audacity to test your microphone before using it on talkbash. It may also help initialize the microphone properly.

After taking care of the prerequisites, you can compile the program for your machine with
javac talkbash.java

The program looks for some sphinx4 classes in your class path. If the class path is not set up right, then you may not be able to compile talkbash. In that case please see installation problems for setting up the class path correctly.

We will describe the steps involved including creating a language model. The downloaded talkbash.zip already contains a language model (somenumber.lm) and a dump file commands.DMP. if these are seen in your directory, talkbash will not prompt you to create the language model. You should temporarily move the .lm and .DMP files to another directory to go through the language modeling steps yourself. Note that talkbash does not detect whether commands.txt has changed, if you change commands.txt, delete commands.sent and the language model files before running talkbash

After the program compiles, you can run it with
java talkbash

This should produce the following output

10 questions, 34 words
Created commands.sent
See instructions on creating a language model
using http://www.speech.cs.cmu.edu/tools/lmtool-new.html
Then use sphinx_lm_convert from sphinxbase to create DMP file
Rerun the program after creating a language model and a dump file

See the instructions above on creating language models. After you create the language model, rerun

java talkbash

This time the output will show that you already have a language model 4999.lm and a dump file commands.DMP. (Please note that we are using 4999.lm as an example, the actual language model file you get from the online language modelling toolkit is likely to have a different number as the first part of the name.)

Now the output will list the questions you can ask and waits for you to ask questions.

(Incidentally you may notice that we ask "where is fire fox" breaking up "firefox" into two words. This is so that the CMU dictionary can find the two words "fire" and "fox", which are common words, while the made up word "firefox" is not in that dictionary.)

10 questions, 34 words
Found sentences file commands.sent
Found language model 4999.lm
Found dump file commands.DMP
Ask some of the questions below (from commands.txt)
-----------------------------------------------
how many bytes of space is available
is my disk full
show me the files
show me the hidden files also
what all is here
what are all the files here and below
what is the latest file
where is fire fox
which directory is this
which text files do i have
-----------------------------------------------

Start speaking. End program using control-C.

You can now ask questions. If you are a native US English speaker using a noise-cancelling microphone, you will probably have good results.

We delibereately used an ordinary microphone to get imperfect recognition. The program is designed to recover from some errors.

Dealing with imperfect recognition

Speech recognizers often do not recognize everything that you say. There is a rather simple way to recover from some errors, especially when dealing with a small language model.

The talkbash program matches the recognized questions with the actual questions it can answer using a slightly modified implementation of Levenshtein distance. Our variant implements a comparison involving sequences of words, (rather than sequences of characters), against the sequences of words in the questions in commands.txt.

What we show below is from one trail. You will probably have different results. The Distance is the Levenshtein distance from the recognized sentence to the matched sentence, a lower value is better (i.e. closer.)

Spoken Recognized Matched Distance
show me the files show of i is show me the files 3
show me the hidden files also show me the hidden files also show me the hidden files also 0
what is the latest file what is many this i what is the latest file 3
what are all the files here and below what all of of i is hidden of what all is here 5
which text files do I have is text bytes do i have which text files do I have 2
is my disk full is my disk full is my disk full 0
how many bytes of space is available i many bytes of space is available how many bytes of space is available 1
what all is here what are all the bytes the of what are all the files here and below 4
which directory is this which directory is this which directory is this 0
where is fire fox many file fox where is fire fox 3

Even though the recognition is not perfect, we generally get the right question. This would not work as well when there are a lot of possible questions. The distance can then be used as a guide to see if we should accept the question or ask the user to try again.

Once we get the question, we still have to get bash to answer the question. This is done in talkbash with an interface to execute bash commands. Although this is simple, there is a problem if you try to run a shell command directly using something like Runtime.getRuntime().exec. We get around this problem in general by creating a small script and then by asking bash to execute that.

We ignored an important requirement

This application does not talk back to the user, it simply does what it recognizes. This is quite dangerous, especially if one of your commands permanently deletes something.

All speech applications need a dialog where the application asks for confirmation. We can do this on the screen or ask Are you sure you want to delete all your files? or some such thing. The Jaivox library is mainly about creating such dialogs. You can see a high level view of this in the Jaivox Interpreter. There is also an application that is part of the Jaivox download that deals with a command interface to use the find command to locate specific files, please see how to integrate with applications.