Word-level audio alignment

This article describes a program that uses the Jaivox library to align words with an audio transcript. The program may be downloaded from the downloads area . The latest Jaivox library (version 0.7) can be downloaded from the downloads page.

There are many applications of aligning audio with text. A frequently cited application is to synchronize closed caption with audio in TV programs and movies. In this example it is useful to make the right words appear as they are spoken. A less precise situation involves aligning sentences with audio. This is useful in creating audio corpora for training speech recognition systems. The web has plenty of references to this problem, as well as several approaches. Our purpose here is not to explain the problem. This article describes a program to do word-level alignment of an audio recording of the Gettysburg address (not by President Lincoln) with the text of the speech. This means that we figure out the small time duration (within a second or two in most cases) where each word occurs in the audio.

The audio for our example is obtained from Wikipedia (download the .ogg file from this site.) The text of the speech is also from Wikpedia (scroll down to find the text.) We have converted the audio file to the Wave format using sox. The audio is not included in the downloaded zip file, but the text, edited slightly, is included.

Contents of the downloaded zip file

You will need to download Gettysburg_by_Britton.ogg. You will need the sox program to convert this file into a mono Wave format file. You will also need to have the current (unreleased) version of the Jaivox library from github

The program for word alignment can be downloaded here (wordalign.zip)

The downloaded zip file contains a directory code with several several sub-directories.

  1. code/cutasr reads a large audio file in the wave format, and cuts it into many smaller files. Then it submits each of these smaller files to the Google speech recognizer and collects the results into a file work/result.txt. Note you need the Jaivox library for this program. But you can just use the work/result.txt we produced earlier.
  2. code/data contains the text of the Gettysburg speech. It also contains a readme.txt file describing how you can download an audio file from Wikipedia and convert it to the mono Wave file required by the cutasr program.
  3. code/work is a work directory, it contains the result.txt from a previous run of the cutasr program. If you want to run cutasr yourself, then you need to put the audio file you download in this directory, then use sox to convert it using the command sox Gettysburg_by_Britton.ogg -c 1 Gettysburg_by_Britton_mono.wav
  4. code/timing is a program to read work/result.txt and data/Gettysburg.txt to produce a listing of each word in the text file and the time interval where the word occurs in the audio file.
  5. code/sphinx is a version of the cutasr program that uses the Sphinx recognizer. To run this, you need to install Sphinx4. You also need to have a language model wsj5kc.Z.DMP that usually comes with some Sphinx4 examples. This recognizer however does not work well enough to get good alignment results.

You can run the program in the timing directory using the work/result.txt that we created earlier. The timing program aligns words to audio using the (partial) recognition results in result.txt

How it works

The audio lasts for over 100 seconds. This is cut up into many overlapping audio files using some utilities in the Jaivox library.After obtaining each of the many overlapping audio files, each lasting about five seconds, these are sent through a speech recognizer (in this case Google's.) The recognition is not perfect, but many of the words are recognized correctly. The results are written to work/result.txt

To run cutasr use

javac cutasr.java
java cutasr

The program in code/timing compares the results with segments of the original text. Here we use various heuristics about the recognition to figure out places where the text and audio may agree. This gives us intervals of less than five seconds where certain words are spoken.

To run timing use

javac timing.java
java timing

Starting with this information, the program guesses the time intervals when various nearby words are spoken.

The output consists of a description of each word in the Gettysburg address along with a small time interval where the word was spoken.

The output looks like the following


java timing
0 0 1 four
1 0 1 score
2 1 2 and
3 1 2 seven
4 1 2 years
5 2 3 ago
6 3 4 our
7 4 5 fathers
8 4 5 brought
9 4 5 forth
10 4 5 on
11 4 5 this
12 5 6 continent

etc.

The output is organized as four columns. Column #1 is the word number, the next column is the starting second of the word, column #3 is the ending second of the word and the last column is the word in Gettysburg.txt that occurs between the times given in the previous two columns. Thus for example, the word seven in column #4 is the third word (starting from 0 - given in column #1) and occurs in the audio between the start of second 1 in column #2 and the start of second 2 in column #3.

You can verify whether these timings are correct by spotting the time that different words are spoken in the audio using a tool such as Audacity.

Using the Sphinx recognizer

The code/sphinx directory contains cutasrSphinx.java. This uses the Sphinx recognizer instead of Google. You need to install Sphinx4 along with the WSJ (Wall Street Journal) audio model. You also need a WSJ language model wsj5kc.Z.DMP that comes with the Sphinx4 package. Please see the batch.xml file in the code/sphinx directory for details.

Our observation is that the recognition results from Sphinx are not very useful. It would be difficult to align the words using Sphinx's output.

For example, here are some of the first few lines of result.txt when we use the Google recognizer.

000 four score and seven years ago our fathers brought forth
001 and seven years ago our fathers brought forth on this continent a new
002 ago our fathers brought forth on this continent a new nation
003 our fathers brought forth on this continent a new nation conceived
004 does brought forth on this continent a new nation conceived in Liberty

By contrast the same audio is recognized very poorly by Sphinx.

000 full auto and seventy years ago
001 off autos road fulton is confident a new
002 off autos road fulton is confident a new nation
003 off autos road fulton is confident a new nation
004 gone is confident a new nation

There may be other ways to improve recognition and timing using phonemes and Sphinx3. But for the simple alignment strategies used in this example, it is better to use the Google recognizer.

Limitations of this program

The programs included here are meant to be for illustration, not for general use. The alignment method used is rather simple, and may not work well on larger data, mainly because of the way we look for commonalities between the recognized text and the original text. The correct way to look for this involves advancing the time as we align, this would make the program harder to understand.

There are several parameters used in the program, these may need to be adjusted based on the recognition accuracy. It is possible to align better by breaking up the recognized and original texts into phonemes. These are not included in the program described here.

The location of each word can be determined more precisely by analyzing the audio files to spot word breaks and other audio features. This requires a much more detailed analysis of the audio data.

(February 12, 2014)