This article describes a program that uses the Jaivox library to align words with an audio transcript. The program may be downloaded from the downloads area . The latest Jaivox library (version 0.7) can be downloaded from the downloads page.
There are many applications of aligning audio with text. A frequently cited application is to synchronize closed caption with audio in TV programs and movies. In this example it is useful to make the right words appear as they are spoken. A less precise situation involves aligning sentences with audio. This is useful in creating audio corpora for training speech recognition systems. The web has plenty of references to this problem, as well as several approaches. Our purpose here is not to explain the problem. This article describes a program to do word-level alignment of an audio recording of the Gettysburg address (not by President Lincoln) with the text of the speech. This means that we figure out the small time duration (within a second or two in most cases) where each word occurs in the audio.
The audio for our example is obtained from Wikipedia (download the .ogg file from this site.) The text of the speech is also from Wikpedia (scroll down to find the text.) We have converted the audio file to the Wave format using sox. The audio is not included in the downloaded zip file, but the text, edited slightly, is included.
You will need to download Gettysburg_by_Britton.ogg. You will need the sox program to convert this file into a mono Wave format file. You will also need to have the current (unreleased) version of the Jaivox library from github
The program for word alignment can be downloaded here (wordalign.zip)
The downloaded zip file contains a directory code with several several sub-directories.
You can run the program in the timing directory using the work/result.txt that we created earlier. The timing program aligns words to audio using the (partial) recognition results in result.txt
The audio lasts for over 100 seconds. This is cut up into many overlapping audio files using some utilities in the Jaivox library.After obtaining each of the many overlapping audio files, each lasting about five seconds, these are sent through a speech recognizer (in this case Google's.) The recognition is not perfect, but many of the words are recognized correctly. The results are written to work/result.txt
To run cutasr use
The program in code/timing compares the results with segments of the original text. Here we use various heuristics about the recognition to figure out places where the text and audio may agree. This gives us intervals of less than five seconds where certain words are spoken.
To run timing use
Starting with this information, the program guesses the time intervals when various nearby words are spoken.
The output consists of a description of each word in the Gettysburg address along with a small time interval where the word was spoken.
The output looks like the following
0 0 1 four
1 0 1 score
2 1 2 and
3 1 2 seven
4 1 2 years
5 2 3 ago
6 3 4 our
7 4 5 fathers
8 4 5 brought
9 4 5 forth
10 4 5 on
11 4 5 this
12 5 6 continent
The output is organized as four columns. Column #1 is the word number, the next column is the starting second of the word, column #3 is the ending second of the word and the last column is the word in Gettysburg.txt that occurs between the times given in the previous two columns. Thus for example, the word seven in column #4 is the third word (starting from 0 - given in column #1) and occurs in the audio between the start of second 1 in column #2 and the start of second 2 in column #3.
You can verify whether these timings are correct by spotting the time that different words are spoken in the audio using a tool such as Audacity.
The code/sphinx directory contains cutasrSphinx.java. This uses the Sphinx recognizer instead of Google. You need to install Sphinx4 along with the WSJ (Wall Street Journal) audio model. You also need a WSJ language model wsj5kc.Z.DMP that comes with the Sphinx4 package. Please see the batch.xml file in the code/sphinx directory for details.
Our observation is that the recognition results from Sphinx are not very useful. It would be difficult to align the words using Sphinx's output.
For example, here are some of the first few lines of result.txt when we use the Google recognizer.
000 four score and seven years ago our fathers brought forth
001 and seven years ago our fathers brought forth on this continent a new
002 ago our fathers brought forth on this continent a new nation
003 our fathers brought forth on this continent a new nation conceived
004 does brought forth on this continent a new nation conceived in Liberty
By contrast the same audio is recognized very poorly by Sphinx.
000 full auto and seventy years ago
001 off autos road fulton is confident a new
002 off autos road fulton is confident a new nation
003 off autos road fulton is confident a new nation
004 gone is confident a new nation
There may be other ways to improve recognition and timing using phonemes and Sphinx3. But for the simple alignment strategies used in this example, it is better to use the Google recognizer.
The programs included here are meant to be for illustration, not for general use. The alignment method used is rather simple, and may not work well on larger data, mainly because of the way we look for commonalities between the recognized text and the original text. The correct way to look for this involves advancing the time as we align, this would make the program harder to understand.
There are several parameters used in the program, these may need to be adjusted based on the recognition accuracy. It is possible to align better by breaking up the recognized and original texts into phonemes. These are not included in the program described here.
The location of each word can be determined more precisely by analyzing the audio files to spot word breaks and other audio features. This requires a much more detailed analysis of the audio data.
(February 12, 2014)