Speech Technology Magazine

 

Making Public Records Public

An ambitious archiving project involving audio indexing and search technologies puts Washington state on the map.
By Leonard Klie - Posted Feb 6, 2009
Page1 of 1
Bookmark and Share

Not long ago, more than 30,000 magnetic audiocassettes containing recordings of legislative sessions, committee hearings, and bits of politics and public policy around Washington state were languishing in a storage area of the capital building in Olympia. Given the age of some of these cassettes—some dated as far back as the early 1970s—many had started to degrade, placing their recorded content in danger of being lost forever. 

Today, thanks to an ambitious project begun in 2004 with Microsoft and global technology services provider EDS, most of those audio records have been rescued, converted to .WAV files that are stored electronically in the state’s digital archives. The archives are a historical database of almost 75 million items that can be accessed 24 hours a day on the Web at www.digitalarchives.wa.gov. The audio files appear in the “Audio Recordings” section of the “Collections” drop-down menu. 

In addition to the digitized audio files, the state archives also contain a rather large collection of government-related photos, spreadsheets, newspaper clippings, and maps, the original state constitution, election results dating back to when Washington became a U.S. territory in 1854, and other government documents, such as birth, marriage, death, census, military, and naturalization records dating back to the Indian Wars. 

Washington’s assistant secretary of state, Steve Excell, calls the rescued audio “information that would have gone away otherwise.” But with the recordings now intact, “a hundred years from now, people will be able to go in, listen to, and track any issue that went through our legislature,” he says.

And while simply converting the more than 60,000 hours of recordings to digital format was enough to earn the state a reputation—and funding from the U.S. Library of Congress—as a trailblazer in the preservation of audio materials, Washington officials didn’t stop there. 

“There was no way to search [the files],” Excell says. “Without that, we would be preserving them without a way for people to get to them. We would be storing them without them being useful.”

The state again looked to Microsoft, and through a unique partnership with Microsoft Research became the first government program in the country to make it possible for the public to search through recorded archival materials to pull up specific instances when a particular word or phrase is mentioned. For example, typing in “volcano” as a keyword in the search tab brings up a list of all the hearings in which the state’s five active volcanoes were discussed. Users can speed through thousands of hours of audio to selectively access only the content that is important to them. 

Users of the archives—who include government officials, students, historians, genealogists, lawyers, and journalists—can search the stored files by name, title, record type, county, or keyword, and also add multiple search terms to refine their searches, according to Excell. “Once you have the digital audio file, the search function takes over, and you can find anything, anywhere,” he says.

That’s great news for historians and anyone who just might want to listen, over the Internet, to an elaborate discussion of the effects of the eruption of Mount St. Helens on the state’s asparagus crop during a 1980 House Agriculture Committee hearing.

“Now you can actually go exploring.You can listen to a whole recording or to the specific points where something is mentioned,” Excell says. “Before, you had to know the actual date of the meeting or hearing to get to the information.”

How It’s Done

The Microsoft audio search technology, developed at Microsoft Research’s labs in China, is referred to as the Microsoft Audio/Video Indexing System (MAVIS). It uses a large-vocabulary continuous speech recognition (LVCSR) engine to index the spoken content of recorded conversations. LVCSR turns the audio signals into text using a preconfigured vocabulary and language grammar, and then converts the text data into an index that contains information about all of the words in the recording, where and how often they occur in the files, similar words that may have been used in reference to the topic, and other metadata, explains Behrooz Chitsaz, director of intellectual property strategy at Microsoft Research.

That’s a big difference from a year ago when Microsoft first got involved in the indexing project. “When we started, there was no metadata with the files, no insight into what was in the recordings. Even the titles and file names were no help,” Chitsaz says. 

And while LVCSR-based audio search systems typically can produce fairly accurate search results, those results are not always a guarantee. In Washington state’s case, accuracy has been hampered by the age of many of the original recordings, the recording environment, acoustics, sound quality, and volume. “For some [files], the sound quality is really, really good, and for others it’s not so good,” he says. “So accuracy can range from 60 percent to 98 percent, depending.”

But Chitsaz notes that like most speech technologies, the audio search solution is adaptive. “We’re constantly training the system with new data and vocabulary,” he says. “We’re constantly tuning and updating it.”

Jerry Handfield, the state archivist, is working with Microsoft to improve the engine’s spelling and proper name recognition. “It’s definitely a work in progress,” he says.

Improving accuracy will be important as Washington state officials move forward with efforts to expand the archives. They have yet to digitize about 10,000 hours of recordings dating from 2001 to the present. Also, a number of older tapes still have to be prescreened to determine whether the sound quality is sufficient for them to be added to the digital archives. 

But for Excell, the real coup de grace will come when the search technology is applied to archived video material, such as broadcasts of local public-access television station TVW. “At some point, we’re sure to have video up [on the archives] that is searchable with the same Microsoft technology,” Excell says. “Thinking about the power of using audio to search video is incredible.”

Washington state’s use of the audio search technology actually happened quite “serendipitously,” according to Excell. "[Microsoft was] looking for a customer, and we were looking for a solution,” he says. “It all came together [in 2008], but it really came about because Microsoft was in on the ground floor. They already knew our architecture because they helped to lay it originally.”

That architecture is based on Microsoft’s Visual Studio integrated development environment, one of the primary software tools used to create the digital archives. All of the files in the archives are housed on a Microsoft SQL Server. A Microsoft BizTalk Server provides a direct input link between hundreds of state and local government officials and the digital archives. And unlike paper or cassette records that can be compromised or destroyed, these electronic records are protected with a digital lock, redundant copies, and off-site backups.

The initial audio filing system and the recently added audio search capabilities have both been a huge success, according to Excell. “We’ve been very pleased. It was exactly what we were looking for,” he says. “It represents two wins for us: preservation and accessibility.”

Microsoft has benefited as well. “The technology is still in research, but with the feedback we’ve gotten from [the Washington project], we’re looking at our marketing and product strategy right now [to see] how we can take this to market,” Chitsaz says. 

Page1 of 1