July 15, 2008
By Leonard Klie Editor, Speech Technology and CRM magazines
Features

Good Enough for the G-Men

In the mid-1950s, U.S. and British intelligence constructed an underground tunnel into East Berlin to tap into the communications of the Soviets. The nearly 15-foot tunnel, located about 20 feet below ground, ended at a telephone junction where three crucial phone lines came together. Inside the tunnel, CIA operatives listened in on everything from top-level talks between Moscow and the Soviet embassy to military barracks chatter. In the 11 months that the tunnel was active, the CIA recorded about 500,000 calls on 50,000 tapes. CIA translators and analysts worked around the clock on the data collected, most of which was useless or undecipherable.

Though the technologies involved have changed dramatically since then, it’s not hard to envision another team of agents holed up today in an underground bunker somewhere on the border between Afghanistan and Pakistan. Their likely mission: to intercept and listen to voice traffic to locate Osama bin Laden. Spy satellites would feed a constant flow of voice and video data into the bunker, where the agents would also monitor TV and radio broadcasts, telephone calls, and Internet traffic throughout the region. The agents would spend their days wearing headphones huddled around very high-tech listening, eavesdropping, and recording devices. Back at CIA headquarters in Langley, Va., hundreds of highly trained translators, processors, and analysts would comb through the thousands of hours of unstructured audio files collected to gather clues. They would rely heavily on audio search and mining technologies to quickly screen the recordings for words or phrases that might tip them off to a potential threat to U.S. interests and identify the voices of known suspects who would perpetrate such acts.

It’s not a script for the latest espionage thriller. It’s a lot closer to reality than you might think. The world over, "a fair amount of governments are doing a tremendous amount of screening, particularly of wireless signals," says Judith Markowitz, a consultant with years of experience in the speech security field. In fact, industry sources have said that more than half of the world’s voice traffic is recorded for security purposes, though only a fraction of those recordings are ever stored or analyzed. Additionally, more than 600 radio and television broadcasts from around the world are regularly monitored, recorded, and analyzed.

Only a few years ago, mining speech-based information from sources like these was a manual process, but the volume of collected material has now far outstripped human capacity alone. Not surprising, some intelligence industry sources have said that data obtained each month by agencies like the CIA, Department of Homeland Security, FBI, National Security Agency, State Department, Defense Department, and countless others is leading to a government information overload. The data collected cannot be measured in bytes, gigabytes, or even terabytes, but rather in petabytes—one petabyte is the equivalent of 1 quadrillion (that’s a 1 followed by 15 zeros) bytes of information, or enough data to fill the Library of Congress 50 times. Industry sources also expect the amount of intelligence that has to be analyzed and indexed to double every six months, largely driven by a high volume of information collected from satellite surveillance and electronic eavesdropping.

This steady increase in the volume of data needing analysis, along with a greater sensitivity to the risk of missing vital intelligence that could prevent another 9/11-style attack, are the two main factors leading to the continued and expanding use of audio search and mining technologies by the government. "There’s no way in heaven that a group of people anywhere can go through all the data," says Donna Fluss, president of DMG Consulting. "They can use speech technologies to identify which items need more attention."

Prem Natarajan, vice president of speech solutions at BBN Technologies, agrees. "The perceived importance [of audio search and mining] is something everyone can understand: the need to go through large amounts of data quickly," he says. Most of the work that his firm does is in the military and government sectors.

But because of the sensitive nature of government contracts in the intelligence area, most technology providers are not at liberty to discuss specific details of their work within the public sector. However, analysts point out that for companies like BBN, Verint, Utopy, CallMiner, Nexidia, NICE Systems, Nemesysco, Aurix, VoiceSense, and others, government contracts are big business. "A lot of the companies that do call center technologies have gotten a lot of funding from the government," Markowitz says. Every year, millions of taxpayer dollars go to private firms like these to develop and improve the technology.

Truth be told, audio search and mining technologies really got their start in the military and government years before they became available for the private sector in 2004, according to Fluss. "The military and government was really the heritage of the technology, but at the time that it was introduced, it did not have all the bells and whistles that it does now," she says.

Good for Business
For private-sector companies, the significant government interest and investment in audio search and mining "legitimizes the technology," Markowitz notes. "If the quality is good enough for the government, it should be good enough for what you want to do with it in your call center. What the government needs [audio search and mining to do] is far more challenging than what a call center needs, so it should be no problem in a corporate environment."

Topping the government’s list of greater demands from the technology is support for a plethora of languages. Unlike corporate call centers, where the predominant language is English, counterterrorism and international intelligence gathering can involve hundreds of languages and dialects. "The number of languages you have to deal with in the government is certainly much higher," BBN’s Natarajan states. "The emphasis is on a much broader set of languages, which presents a much different set of challenges."

For some languages, there might not be enough training data available, he explains. Others might not have a standard written form or might have hundreds of dialects and regional variations.

Adding to the complexity, many of these intercepted communications are more conversational in nature, Natarajan continues. "The language structure changes. There is not the same turn-taking that you get in formal IVR dialogues," he says. "You also get a lot of colloquialisms and jumping between topics."

Speaker Recognition
Further complicating matters, all of the speakers in an intercepted conversation might not be known or recognizable. This differs greatly from a call center, where the dialogue typically involves a customer, who provides some means of identification, such as a name or an account number, and a live agent, who is employed by the company.

Another challenge faced by the intelligence community is that the parties to intercepted conversations seldom speak in specific terms, but rather, they often talk in code. As an example, Jeff Gallino, chief technology officer at CallMiner, cites the masterminds behind the terrorist attacks of 9/11.

"The 9/11 guys did not talk about planes or bombs or the World Trade Center, but about a wedding," he says. "So you can’t use traditional mining methods. You have to use underlying analytics to look at the context [of a conversation] to uncover the meanings—not just the words, but the external factors to find out what those words mean."

That kind of data is more difficult to compile. "For a system to tell me every time someone says a word, that’s easy," Natarajan adds. "For it to tell me every time he talks about a topic, that’s harder to do because of context."

Part of the reason for that is that typical speech analytics and mining solutions "look at preprogrammed information to find the statistically relevant information," notes Daniel Ziv, vice president of customer interaction analytics and business interaction intelligence at Verint. "Are we looking for the right words? Is it even the right environment?"

Audio search and mining technologies, therefore, "need to be as targeted and high-precision as you can make them," Natarajan says. "You need to build as large a vocabulary as you can. In government, having a half million words is not uncommon." The typical call center vocabulary, on the other hand, is often less than 50,000 utterances.

The intelligence community faces another challenge with regard to audio search and mining that typical call centers do not. Because of how the voice recordings are obtained, sound quality is often not as good. Signal strength, acoustics, background noise, side conversations, weather, and environmental factors all play a role in degrading the quality of recordings. "There is some recent work we’ve seen in improving microphones and recording technologies, but some things need to be pushed further along," Natarajan says.

As if all of these other factors weren’t enough to make even the most intrepid in the intelligence community cringe, perhaps no factors work more heavily against government eavesdropping than those that involve legal issues and the court of public opinion. While a basic call center can usually cover itself with a simple warning—This call may be monitored or recorded for quality assurance—many more gray areas exist in intelligence gathering. For the government, "the technologies can help more than they can hurt, but it’s a balancing act to make sure they are used in the right way," Ziv says.

Another balancing act has to occur between the technology’s capabilities and human intuition and skills, Fluss says. "Where speech analytics is really good is identifying those pieces that need further analysis," she explains.

And even then, "it’s not just about the technology, but what you do with the information and what you can glean from it," Fluss continues. "What good is it to know about a terror plot ahead of time if you can’t take the steps to prevent it?"

Business Benefits
Meeting all these challenges has admittedly been a struggle, but because of their roots with the government, vendors of audio search and mining technologies are in a better position to serve their corporate clients. "These are still relatively new applications on the commercial side, but they should capture real attention because of their ability to deliver quantifiable results," Fluss says. "The amount of money being invested in the technology is tremendous, and since it has been introduced into the commercial world, it has just gotten better and better."

Since the solutions have become available to the private sector:
•    More insight can be gleaned from the data. With the typical audio file, every word and phrase is indexed and filed along with a record of where and how often it was said during the conversation, the context in which it was used, and who said it. Time stamps are attached to each word as well, so they can easily be located within the index. Solutions are also able to assign speaker identifications to each party of a conversation. Adaptive learning allows vocabularies and search parameters to expand naturally and dynamically over time.
•    More foreign language capabilities are available. When dealing with foreign-language recordings, the system has to be able to translate the file into English so it can be analyzed and indexed. This has also given rise to many of the advances in machine translation and natural language processing, such that applications today can, for example, match a single Arabic name to its many English-language spellings and pronunciations. Nexidia alone claims that its audio mining technologies support 33 languages.
•    The accuracy of solutions is much better, in the range of 85 percent to 90 percent for most applications.
•    The speed of solutions is much better, at rates of up to 100 times real time, coupled with the ability to filter through multiple files simultaneously.
•    More open-source technologies have been introduced as a result of the many conversations of interest that now occur within more nontraditional modes, including fixed, cellular, and Voice over Internet Protocol (VoIP) telephony, the Web, email, chat, instant messaging, blogs, broadcasts, and video files.
•    Predictive analytics that can identify trends and patterns that might indicate a planned course of action or future behavior have also emerged. In the case of a corporate call center, for example, this might allow businesses to identify potential customer churn or target particular callers with the appropriate up-sell and cross-sell opportunities.
•    Visualization tools that display findings in multidimensional forms, such as charts, graphs, and maps, that indicate the relationship of data points in geographic settings or over time have also been incorporated.
•    Improvements in indexing, compression, and storage systems have cut the amount of memory that files require. In many cases, organizations can store just the index and trash the original source recording after it has been analyzed.

What’s Next?
So where does the technology go from here? For most vendors, the biggest area of development is or will be in linking text, video, and audio data. According to Verint’s Ziv, that area is evolving quickly. "With these types of applications, in the future you will see a mix of camera and speech data," he predicts. "[Analysis] is being done separately now, but I can see a convergence, correlating and combining the feeds."

"I’ve heard of a couple of systems where the audio input is tied to the video feed to determine whether something criminal is going on," Markowitz adds. "Before, they could just work on the audio."

These efforts have also led to rapid advancements in real-time foreign broadcast monitoring, an endeavor into which more and more vendors are entering.

And while foreign broadcasts can be transcribed and translated in real time, users of speech search and mining applications do not have the benefit of real-time analysis. "Being able to do things [with live voice files] is more in the movies right now," Natarajan says. But, don’t be surprised to see that capability in the near future.

And as far as how the technology is being used, current applications "are just the tip of the iceberg," Fluss emphasizes. "We can all expect to see some really creative and enterprising uses. There are so many high-value uses in the pipeline. We’re really limited only by our own imaginations."

_{9-1-1: Where Government and Call Center Meet}
In the government sector, one area that has seen a lot of traction of late for audio search and mining technology is the call centers that field 9-1-1 calls. The reasons are obvious: Better call handling and reduced holding times mean that police, fire, and emergency services personnel can be dispatched much more quickly.

When compared with traditional call centers, "an effective agent response is much more important," says Daniel Ziv, vice president of customer interaction analytics and business interaction intelligence at Verint, an analytics solution provider that counts many police departments and emergency services operations among its clients. "It’s a matter of life and death, and shaving seconds off a dispatch could save lives."

Conversely, "calls being placed on hold, etc., can have disastrous effects," adds Brendan Dillon, customer interaction analytics program manager at Verint. "With [audio search and mining], processes can be greatly improved."

In particular, many 9-1-1 call center managers are using the technologies for agent training to make sure agents are handling and routing calls properly, all while maintaining the highest levels of sensitivity and professionalism. That is important, according to Jeff Gallino, chief technology officer at CallMiner, because "what’s hard about 9-1-1 is that everyone calling in is in a high emotional state."

Calls to 9-1-1 are typically logged and recorded; using speech search and mining, departments can retrieve, replay, reconstruct, and distribute those recordings quickly. With just a few key strokes, they can search call logs by date, time, proper names, locations, conditions, or other terms related to the case or incident.

With 9-1-1 tapes being used as evidence in trials, search and mining technologies are also helping departments locate specific call information quickly. "In prepping for court cases, they are required to give over all relevant call logs," Ziv says. "There could be hundreds of calls, and sifting through them all could take days and days otherwise."

And of those calls, many are not emergency-related. Departments can use analytics to figure out why those calls are coming into the call center and take steps to prevent them, Gallino maintains.

Additionally, audio search and mining can help police identify high-crime areas and position their resources to address them. For police, the greatest value from the technology comes "when a bunch of callers are saying the same things and they can spot trends," Ziv explains. "If [conditions] are mentioned suddenly more often, they can see that. If there’s an increase in a particular crime in a specific area, that will surface automatically."

Good Enough for the G-Men

ServiceNow Partners with OpenAI on Voice AI

FlashLabs Releases Chroma 1.0 Voice AI Model

Agora Partners with MiniMax on Voice AI

VoiceRun Launches Voice AI Platform with $5.5 Million Seed Round