Deep Learning, Big Data, and Clear Standards

Article Featured Image

Machine learning has played a huge role in the recent dramatic improvements in speech recognition and natural language understanding—and in the applications, like virtual assistants, that use these technologies. Related technologies like face recognition, emotion recognition, and translation between languages have also benefited. But machine learning depends on data, and the more data the better.

In the case of deep learning, a relatively new and very promising approach to machine learning, data looms even larger. When the goal of a machine learning effort is something like speech recognition, data is (relatively) plentiful, because speech recognition is a generic task—at its essence, its goal is to classify audio streams into streams of words.

Other tasks are even more generic. For example, in emotion recognition, the goal is to characterize an expression of emotion—like the look on someone’s face or someone’s tone of voice—into one of a fairly limited set of emotions. Opinions vary about how many human emotions there are, but the number of emotions recognized by different theories range from four to 24. Even 24 emotions are not very many, and because of the small number of emotions to tell apart, it’s easy to get examples of all of them.

In all machine learning tasks, the problem to be solved becomes more difficult as the number of things that have to be distinguished increases. More data is required because the machine learning or deep learning algorithms need to have many examples of different categories. This is one reason why limiting the domain in a natural language task is so useful. In a specific domain, like customer service for a single product, the result of having a constrained domain is that less data will be needed to train the system. This explains why natural language understanding tools such as Microsoft LUIS, Wit.ai, or Dialogflow don’t require huge amounts of data. They don’t attempt to understand an entire human language, just the utterances that are needed for a specific domain. Developers can easily supply enough examples to train these narrow systems, typically only a few hundred utterances. Thus, they are very effective and can represent a small number of types of meanings well.

But what if we want to represent broader domains—in developing, for example, a shopping assistant for a company with thousands of products—or more sophisticated meanings, such as hypothetical situations, the scheduling of future events, or the planning of multiple actions? To build systems that can handle broader domains and more complex language, we need more data than the few hundred—or even the few thousand—utterances that individual developers can provide. We need very large amounts of data, and it has to include annotations, the information that tells the system what the utterances mean.

Getting large amounts of annotated training data is costly. Large commercial organizations can do this kind of data collection and annotation internally, in proprietary formats, but then the data is not available to the larger research community. Large research programs sponsored by government organizations like the National Science Federation also have the resources to pay for data annotation, and they do. These government-sponsored efforts have led to the development of important data sets like the Air Travel Information System (ATIS), the Penn Treebank, and FrameNet.

These large annotated data sets are extremely valuable and are useful for a very long time. For example, the ATIS corpus, sponsored by Defense Advanced Research Projects Agency (DARPA) and collected between 1990 and 1994, is still widely used and is frequently cited in research papers. Government sponsorship of large data collection efforts will be critical to the development of future natural language understanding systems.

Here’s where standards come in. To be generally useful to many organizations in the research community, the data annotations have to be consistent, generally agreed on, and well documented. Otherwise, a company that wants to train its natural language processing system using data from several sources will have to spend a lot of time converting the data between multiple, possibly arbitrary formats. Not only do annotation standards help developers, but they make it much easier for researchers and developers from universities, start-ups, and other small organizations to contribute their own data to the research community.

As an example, steps are being taken to develop standard annotation formats for more sophisticated language understanding concepts like time, pronoun reference, “modal” concepts like “should” and “can,” and dialogue acts. A recent ISO workshop, “Interoperable Semantic Annotation,” took place in September and explored annotations for more complex semantic concepts. We’ll soon need to have data sets with these kinds of annotations for the next generation of natural language understanding systems, and workshops like this one are a promising start.

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

4 Speech Technology Standards That Need to Happen ASAP

With new technologies appearing all the time, standards must keep pace

Let’s Get Siri, Cortana, and Alexa to Work Together

A new W3C community group is exploring ways to make virtual assistant platforms interoperable

EMMA 2.0 Lets Applications Decide What to Tell You—and How

Information can be graphical or spoken, depending on context