We have a new application for issuing Linux commands through speech. It can be used for other operating systems too by changing the actual commands that are executed. The example in this article handles only a few commands, but it is easy to add other commands. Even though the application is for Linux, it can be modified for other operating systems such as Windows.
This application is called talkbash. You need Sphinx4 from CMU as well as Sphinxbase for building a language model. You can download talkbash.zip which includes the application source code.
The talkbash program issues bash shell commands in response to spoken instructions. A text file commands.txt specifies the things recognized by the program (actually by Sphinx) as well as the commands that are executed as a result.
For example, you could say
"where is fire fox?"
and the bash command which firefox will be run and the results such as /usr/bin/firefox will be printed out.
The commands.txt file included here lists only 10 sentences. You can add more sentences and corresponding commands. The spoken sentence and the command are separated by a pattern "xxx" (three lower case "x" s) with at least one blank space to both sides of this pattern. You can change the pattern used here in case you need "xxx" for some command (can't think of any though.)
Any speech recognition application using Sphinx needs an Audio Model (you can use one that comes with Sphinx4) and a Language Model. The Language Model (or "lm") tells sphinx about the words and combinations of words that may occur in your speech.
Before using talkbash, you need to install Java, Sphinxbase and Sphinx4. If you have some difficulty with these, perhaps installation problems may provide some answers.
Ideally you should use a noise-cancelling headset for speaking to talkbash. We get the best results this way. If your computer's sound processing is very good you may not need this. One caveat though, it may be a good idea to use a program like Audacity to test your microphone before using it on talkbash. It may also help initialize the microphone properly.
After taking care of the prerequisites, you can compile the program for your machine with
The program looks for some sphinx4 classes in your class path. If the class path is not set up right, then you may not be able to compile talkbash. In that case please see installation problems for setting up the class path correctly.
We will describe the steps involved including creating a language model. The downloaded talkbash.zip already contains a language model (somenumber.lm) and a dump file commands.DMP. if these are seen in your directory, talkbash will not prompt you to create the language model. You should temporarily move the .lm and .DMP files to another directory to go through the language modeling steps yourself. Note that talkbash does not detect whether commands.txt has changed, if you change commands.txt, delete commands.sent and the language model files before running talkbash
After the program compiles, you can run it with
This should produce the following output
10 questions, 34 words Created commands.sent See instructions on creating a language model using http://www.speech.cs.cmu.edu/tools/lmtool-new.html Then use sphinx_lm_convert from sphinxbase to create DMP file Rerun the program after creating a language model and a dump file
See the instructions above on creating language models. After you create the language model, rerun
This time the output will show that you already have a language model 4999.lm and a dump file commands.DMP. (Please note that we are using 4999.lm as an example, the actual language model file you get from the online language modelling toolkit is likely to have a different number as the first part of the name.)
Now the output will list the questions you can ask and waits for you to ask questions.
(Incidentally you may notice that we ask "where is fire fox" breaking up "firefox" into two words. This is so that the CMU dictionary can find the two words "fire" and "fox", which are common words, while the made up word "firefox" is not in that dictionary.)
10 questions, 34 words Found sentences file commands.sent Found language model 4999.lm Found dump file commands.DMP Ask some of the questions below (from commands.txt) ----------------------------------------------- how many bytes of space is available is my disk full show me the files show me the hidden files also what all is here what are all the files here and below what is the latest file where is fire fox which directory is this which text files do i have ----------------------------------------------- Start speaking. End program using control-C.
You can now ask questions. If you are a native US English speaker using a noise-cancelling microphone, you will probably have good results.
We delibereately used an ordinary microphone to get imperfect recognition. The program is designed to recover from some errors.
Speech recognizers often do not recognize everything that you say. There is a rather simple way to recover from some errors, especially when dealing with a small language model.
The talkbash program matches the recognized questions with the actual questions it can answer using a slightly modified implementation of Levenshtein distance. Our variant implements a comparison involving sequences of words, (rather than sequences of characters), against the sequences of words in the questions in commands.txt.
What we show below is from one trail. You will probably have different results. The Distance is the Levenshtein distance from the recognized sentence to the matched sentence, a lower value is better (i.e. closer.)
|show me the files||show of i is||show me the files||3|
|show me the hidden files also||show me the hidden files also||show me the hidden files also||0|
|what is the latest file||what is many this i||what is the latest file||3|
|what are all the files here and below||what all of of i is hidden of||what all is here||5|
|which text files do I have||is text bytes do i have||which text files do I have||2|
|is my disk full||is my disk full||is my disk full||0|
|how many bytes of space is available||i many bytes of space is available||how many bytes of space is available||1|
|what all is here||what are all the bytes the of||what are all the files here and below||4|
|which directory is this||which directory is this||which directory is this||0|
|where is fire fox||many file fox||where is fire fox||3|
Even though the recognition is not perfect, we generally get the right question. This would not work as well when there are a lot of possible questions. The distance can then be used as a guide to see if we should accept the question or ask the user to try again.
Once we get the question, we still have to get bash to answer the question. This is done in talkbash with an interface to execute bash commands. Although this is simple, there is a problem if you try to run a shell command directly using something like Runtime.getRuntime().exec. We get around this problem in general by creating a small script and then by asking bash to execute that.
This application does not talk back to the user, it simply does what it recognizes. This is quite dangerous, especially if one of your commands permanently deletes something.
All speech applications need a dialog where the application asks for confirmation. We can do this on the screen or ask Are you sure you want to delete all your files? or some such thing. The Jaivox library is mainly about creating such dialogs. You can see a high level view of this in the Jaivox Interpreter. There is also an application that is part of the Jaivox download that deals with a command interface to use the find command to locate specific files, please see how to integrate with applications.