Speech Technology Magazine

 

Application Development: Just The Facts

The information age is quickly becoming the age of information overload. We often just want to get the gist of a news story, document or meeting, rather than have to read or sit through the whole thing. However, while we intuitively know what the gist is, it is difficult to define what processes are needed to automatically extract the gist from a source.
Bookmark and Share
The information age is quickly becoming the age of information overload. We often just want to get the gist of a news story, document or meeting, rather than have to read or sit through the whole thing. However, while we intuitively know what the gist is, it is difficult to define what processes are needed to automatically extract the gist from a source.

Broadly speaking, gisting falls somewhere in the range from simply determining the topic to providing a full prose summary. While for a human these are fairly easy tasks, it is still beyond the state of the art for computers to be able to understand a text and pull out relevant content or even for speech recognition systems to be able to transcribe all of the words correctly, especially in noisy conditions or for conversational speech.

Between the two extremes of determining the topic and full understanding is what we call information extraction, that is, pulling out the who, what, where, and when. This capability is a central building block of gisting. At GTE, we are pioneering the use of statistical modeling techniques for information extraction that make use of annotated examples rather than relying on hand-crafted rules. In cases where rules are available for some or all of the information we want to extract, we integrate rules into the system and use the statistical models to find the most likely combination of rules given the data. This approach makes automatic information extraction tractable and portable to new domains and languages.

Gisting from a speech signal differs from speech recognition in that not every word of the speech needs to be recognized, and is similar to information extraction in that the information that is identified is processed and used to fill templates. For example, we have shown that it is possible to process conversations between pilots and air traffic controllers in order to determine what flights are in the area, classify whether they are landing or taking off, and fill templates based on particular air traffic commands, such as heading directions or clearances. Despite the fact that these off-the-air recordings are very hard for the average person to understand, the information extraction component filled templates with a 68-90 percent precision rate, depending on the command type.

In this article, we first describe our overall statistical approach to information extraction and highlight three basic techniques: named-entity extraction from text (finding names of people, places, organizations), named entity finding from speech, and topic identification. We then describe two applications, gisting from speech in the Air Traffic Control domain, and an application to get the gist from meetings, which combines several of these new techniques.

Information Extraction

An essential building block for a gisting system is the ability to extract key pieces of information from a text. The MUC (Message Understanding Conference) program, has focused on template filling from text in domains ranging from terrorism to mergers and acquisitions. Most approaches to the problem have been based on designing rule sets that look for particular patterns in a text which, when found, fill particular slots in a template.

At BBN we are pioneering a new technique based on statistical modeling in which the information extraction system learns to recognize the information from examples. As shown below, we begin with a training corpus of sentences which we annotate with the answers, that is with the categories of meaning we wish to extract. We then formulate a statistical meaning model and train it on the annotated corpus. This statistical model is used at run time to extract information from new texts.

gistfig1.gif (6324 bytes)Note that while in general the semantic annotations are done by hand, we have used a rule based semantic parser to annotate data automatically. The model is then used to find the most likely interpretation. In the next section, we describe these techniques in more detail as they apply to extracting names.

Importance of Names

Often the most important pieces of information in a text are the names: the people, places, organizations, and dates. Though this sounds clear, enough special cases arise that require lengthy guidelines, e.g., when is The Wall Street Journal an object and when is it an organization? When is White House an organization and when a location? Are branch offices of a bank an organization? Should yesterday and last Tuesday be labeled dates? Is mid-morning a time? For human annotator consistency, guidelines with numerous special cases have been defined for the Seventh Message Understanding Conference, MUC-7.

We have recently developed a new technique for detecting and labeling named entities in text, and implemented it in a system we call Identifinder. The algorithm uses a Hidden Markov Model (HMM) to model the word sequences in and around each type of entity. We have used Identifinder to detect instances of names of people, locations, organizations, as well as dates, times, percents, and money amounts. The algorithm results in accuracy scores in the low to mid 90s on newswire, which is very close to that of the most accurate rule-based system, and significantly better than the error rate of any other learned system. In addition, on uppercase text, the system significantly outperforms all other systems. The algorithm is easily trained on any new data. So if we want to add several different types of names, it is only a matter of annotating text with the categories of those names. Figure 2 illustrates the result of name finding on English text and on Spanish text.

Figure 2. Identifinder finds names of places, people, and organizations in multiple languages

The delegation which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzie.

Este ha sido el primer comentario publico del presidente Clinton respecto a la crisis de Oriente Medio desde que el secretario de Estado, Warren Christopher, decidiera regresar precipitadamente a Washington para impedir la ruptura del proceso de paz tras la violencia desatada en el sur de Libano.

  • Locations
  • Persons
  • Organizations

Our previous experience with handwritten rules is that each new source of text requires significant tweaking of rules to maintain optimal performance. For example, tackling the newspaper sources of the New York Times newswire after developing a rule set for the Wall Street Journal requires significant hand tuning. If the technology is good enough to be embedded in various applications, the maintenance costs for hand-crafted rule systems could be quite steep. Furthermore, moving to other modalities such as speech input, or merely to upper case text, may require substantial modification of rule sets to obtain optimal performance for each modality.

In our earlier experience with hand-crafted rules, we found that rules for one language may help very little in developing rule sets for another language. While the English rule set was suggestive for developing a rule set for Spanish, virtually nothing carried over to a rule set for Chinese.

Our new method is language independent, since the algorithm itself does not depend on any specific linguistic information. Given training data that has been annotated with the identity of the various entities to be recognized, there is an automatic training procedure that estimates models for the different entities. We have recently demonstrated the language independence of the system by applying it successfully to Spanish data.

Names from Speech

There are two issues in moving from named entity in text to speech. First, there is no capitalization or punctuation in speech, and these two graphic features are important clues to finding names in text. Table 1 shows the impact on performance without case and punctuation on New York Times text data.

Table 1.
Performance variation on New York Times data
Condition Score
Text Baseline 94.8
Upper case 92.6
Upper case, no punctuation 90.5

Second, speech recognition is still far from perfect, especially in noisy domains or on conversational texts. Table 2 shows results on true transcriptions and on the output of the speech recognizer on broadcast news data. As these results show, statistical models are quite robust to errors and can be trained to look for clues other than the capitalization and punctuation and thus have only slight degradations in performance in these conditions.

Table 2 .
Performance variation on broadcast news data
Condition Score
True transcription of broadcast news 87
Speech recognition with 20% word error rate 73

Topic Classification

Another useful characterization of audio content can be made by automatically labeling the topic. This information is much more useful for gisting when it produces a set of topics rather than a single one. Furthermore, sometimes topics are indicated by words that are not actually found at all in the target segment, such as the topic Economic Assistance applied to a story on US/Mexican relations, as shown below.

gistfig3.gif (7736 bytes)

To date, there has not been much work in the area of topic classification. Furthermore, previous work considered only a small number of topics and modeled only one topic per segment. We have developed a new model for automatic topic classification using a probabilistic HMM that is estimated from training data that has been labeled with as many as 10,000 topics. The system produces a sorted list of the most likely topics for each segment.

Our topic HMM includes a model for every topic encountered in training and a model for general language that acts as an absorber of words that are not strongly associated with any topic. Using our model, we find that most topics are related to less than 10% of the words in the segment - more than 90% are general language words. At the same time, the unique words modeling a topic number only a few hundred out of a vocabulary of more than one hundred thousand. This property of the model is intuitively appealing; most words in the language are associated with specific topics, but most words in any given passage are not.

We have conducted experiments on broadcast news data using 1 year's worth of annotated training data containing 4600 unique topics that occurred in at least two stories. We tested the system on 1000 new stories from a later and disjoint time period. 76% of the top choice topics were among the annotated topics in the ground truth data, which was hand labeled. This level of performance is subjectively good enough to provide useful summarization cues for the data since many of the topics that didn't match with the reference annotation were due to flaws in the reference itself. Since the number of possible topics is so large, the ground truth data contained numerous errors of omission.

Applications

BBN's primary goal on the Gister project has been to develop the technology for the automatic extraction of information about air traffic activity based on the dialogs between air traffic controllers and pilots. Another goal is the development of the tools that will make this technology easily deployable to new air traffic domains. This project requires the integration of many information extraction techniques in order to provide an understanding of a complex human activity. In an air traffic control situation we have controllers conversing with pilots with the conversation between a particular controller and pilot distributed over time, interspersed with other conversations between the controller and other pilots.

  • Execution of the Gister task requires the following steps:
  • Identify which utterances are from controllers and which are from pilots;
  • Group the appropriate controller/pilot pairs, i.e., statement-responses together;
  • From the pairs reconstruct the controller/pilot dialog;
  • Extract the required information from the dialog.

Just as the speech recognition process itself will be full of errors, so will some of the groupings and dialog creations; however because the Gister is combining information from many different techniques and coming up with a high level gist rather than full understanding. Thus it can overcome errors in these lower level, support activities.

Anyone having listened to air traffic communications will understand that it is a difficult medium from which to extract information from for untrained listeners. This is in part due to the jargon of the communication, the noise on the channel, as well as the dispersion of the dialogs across time. What enables Gisting to work is that it exploits the structure that exists within the communication. This structure includes:

  • the protocol for the interaction between pilots and controllers (usually followed more strictly by controllers than pilots);
  • the restrictions on the ways things can be said;
  • temporal constraints on interactions;
  • the limited number of activities that are discussed.

The Gisting system exploits the structure in the air traffic controller dialogs in a variety of ways. For example, recognition of flight identification, even though only partially recognized correctly can be a significant aid in dialog formation. However, the most important technology development for success in this project has been the development of semantic grammars for this application. We used a context free parser in two ways:

1. The combination of grammar rules and statistical grammar. This innovation allows a user to specify structures by means of context-free grammar rules. However, a statistical grammar is created based on an expansion of these rules. The resulting grammar has the advantage of a rule based grammar in which structure can be easily identified. It also has the advantage of a statistical grammar where a complete parse is not required and a probability of any word sequence is trained from data. It is more powerful than n-gram language models (models that depend on the previous n-1 words) because it can capture long range dependencies

2. Use of rules to parse the recognition output. By using the rules from the recognition grammar and other rules that specify other structures, we can identify semantic entities from the recognition output. These semantic entities may be needed in information retrieval applications or form-filling. They were also be used as input to the activity classifier.

We found that although the speech recognition word error rates were very high (above 40%), the information was correctly extracted most of the time. Direction orders (e.g. Turn left heading two four zero) had a 91% precision and 81% recall. These results highlight one of the main advantages of this approach, that even with errorful input, useful information can be found.

Rough'n'Ready

One of the elusive goals of collaborative technology is the ability to record automatically the proceedings of meetings, capture their gist, and browse their contents. (We use here the word meeting to refer to both meetings and presentations, whether collocated or distributed, as in telephone conference calls and video conferences.) Ideally, one would like to be able to scan through a meeting, determine which parts are interesting or relevant, and either get a transcription of what was said, or store the original in a highly indexed archive from which it may be retrieved easily.

In the Collaboration, Visualization and Information Management (CVIM) program, we are building a meeting recorder and browser, called "Rough'n'Ready" (see figure below) that will automatically produce a ROUGH transcription of what was said along with a content-based structural summarization of the audio recording that is READY for browsing. The summarization meta-data provides a framework for data visualization and an index for efficient navigation of large audio archives. The basis of the structural summarization is the automatic transcription produced by our state-of-the-art large vocabulary speech recognition system. On top of the transcription, we have added annotations from three relatively mature component technologies - speaker demarcation, named entity extraction, and topic classification.

gistfig4.gif (18372 bytes)

The Next Step

The technologies described here take us a long way towards getting the gist of a new story, article, or recording: Speech recognition takes sound and produces words, topic identification tells us what the source is about, and named entity extraction tells us what people and places were mentioned. Ongoing research on relationship finding is beginning to provide "who did what to whom." While all of these individual technologies are fairly immature, in that there are few products and the results are often imperfect, the applications we described show that useful information can be found by combining the technologies in novel ways.

For a true gisting system, the missing piece is a component that determines what content the author or speaker thought was the most important and what is most relevant to the user. BBN's Rough'n'Ready circumvents this problem by providing a browsing capability so that the person who wants the information can easily find it for themselves. In the longer term, we see this as another challenge for a statistical learning algorithm supplemented with a user profile so that the information we want to see appears on our desks without a flood of unimportant distractions.

Marie Meteer, Herbert Gish, Francis Kubala, Richard Schwartz, and Ralph Weischedel work for BBN Technologies/GTE Internetworking, 70 Fawcett Street, Cambridge, MA., 02138 and can be reached at 617 873-3052.

Page1 of 1