A Perfect Storm of Speech Technologies

Article Featured Image

When I started working in natural language understanding (NLU) in the early 1980s, we envisioned a not-too-distant future when natural, real-time, spoken conversations with computers would be an everyday occurrence. We were pretty optimistic about when that would happen, too. Surely by 2000, these applications would become commonplace. When that didn’t happen, my colleagues and I, as natural language researchers, were quick to pin the blame on the immaturity of speech recognition. Surely, speech recognition researchers were thinking the same thing about natural language understanding.

Over the years, both speech recognition and NLU have improved dramatically, but not at the same rate. Speech recognition would get better, and then NLU would get better, but the state of the art in these technologies was often out of alignment, with one further along than the other. And since conversational AI systems depend on several technologies all working well, there have always been one or more technical weak links in the chain of conversational AI technologies that prevented this vision from becoming reality.

This is no longer true. Speech recognition and NLU are now both very good. The same goes for related technologies like natural language generation, text-to-speech, and even machine vision and image generation. What underlies this success? Much of the improvement can be attributed to general computing infrastructure changes. For example, faster computers and networks have enabled researchers to try out different ideas much more quickly than in the past. The internet has accelerated research by making it possible for researchers to quickly share their ideas with colleagues all over the world. The rise of open-source software has also meant that technologists can easily share many of their research tools. Regardless of the reasons for this improvement, it is certainly real.

All of these technical advances combine to make a perfect storm of technologies, and now is a good time to step back and consider what new applications might be possible. Certainly, these advances can improve traditional applications like interactive voice response. What other applications could we unlock with these new capabilities? And how can we create synergies with related technologies like computer vision, image generation, and robotics?

One way to leverage these technical improvements is to return to some earlier applications that were based on good ideas but were technically premature. An example would be customer support chatbots, which, with the more advanced technologies, will be able to cover more types of questions. We should return to chatbot designs that didn’t try to answer users’ harder questions because the developers assumed the existing speech recognition and NLU technologies wouldn’t be up to the task; it’s much more likely that they are now. As with any conversational AI application, limited initial distribution and thorough testing are key to ensuring that users will not be frustrated with the application’s performance.

Meeting summarization is another example. In the past, meeting summaries often contained errors, but newer speech recognition and NLU could yield better results. As a bonus, we could also try identifying various meeting dynamics, such as how much time different people are talking, or even which participants tend to go off-topic.

Going beyond more traditional applications, here are some other possibilities for new types of applications that combine computer vision, voice, and natural language understanding.

  • Sports coach. Make a video of a golf swing and get tips on improving it, such as holding your head down or holding the club differently. An advanced version could analyze a video of multiple players in a team sport.
  • Music coach.Make a video of someone playing a musical instrument. AI analyzes the music and hand motions and generates natural language tips to improve the user’s playing. This could also apply to improving singing.
  • Acting coach.Record a video of someone reading lines from a script, give the system a goal like “I want to act sympathetic” or “I want my acting to be like [famous actor],” and generate suggestions for improving the acting.

There is also a whole class of applications involving language interaction with items in the world, like robots and drones or smart devices like light and temperature controls. There isn’t space to explore them all here, but you can see the range of opportunities for exciting applications in these areas.

Speech, language, and computer vision technologies are going to continue to get better. It will be exciting to see what creative entrepreneurs will come up with as we start to fully understand their newer capabilities. 

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversationaltechnologies.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues