Q&A: Tobias Dengel, CEO of WillowTree, on Multimodal Apps Involving Voice

Article Featured Image

Multimodal applications that include speech are seemingly ubiquitous, but building them is no small feat. James Larson, co-chair of the SpeechTEK conference, talked to Tobias Dengel, CEO of WillowTree, about what what it takes to build these apps, what is holding voice back, and how it can be fixed. Dengel will expand on this further during his SpeechTEK 2020 presentation.

What is the difference between multimodal apps and multimedia apps?

Multimedia refers to media content that is distributed to users in a variety of platforms. Multimodal is two-way, both in how we get information from machines and in how we communicate back to them. The second aspect of multimodal is that we can seamlessly switch between platforms. I can, for example, ask Alexa what movies are playing tonight and get a response via text or in my app.

What do platforms like Lenovo, JBL Link View, Google Home Hub, and Amazon Echo Show provide beyond voice-only platforms like Amazon Echo ad Google Home? What are the advantages of using these new platforms?

These new devices are first-generation products truly trying to take advantage of the concept of multimodal, which is, at its simplest level, based on the fact that we can speak faster than we can type and can read fast than we can hear. The interface of the future is humans speaking to machines and machines responding with text/graphics. Most of this text/graphics response will be via the device that is always on and always with us, the smartphone, although there will be certain applications where the TV screen, a car screen, or an Echo Show makes sense.

Why are HBO, Fox, Regal Cinemas, and Synchrony Bank shifting their thinking from voice to multimodal?

For many of our clients, the original voice-only experiences that Alexa or Siri provided just didn't make sense or add value. Now, with multimodal, the core concept is increasing the efficiency of the "instruction process" that we humans go through to communicate with machines via typing, tapping, or swiping by five or 10 times. One of my favorite demos is ordering a complex pizza in 10 seconds via voice vs. 45 seconds via an app, but then getting the final order texted to the user to confirm.

How much additional effort is needed to construct a multimodel app than a traditional voice-only app?

It really depends. If a company has an existing app, the wiring for voice can be relatively straightforward, especially for narrow use cases (e.g. "what's my balance?"). If everything, including back-end systems, has to be created from scratch, it's a big lift.

What skills are needed to create multimodal apps?

One of the things w''re very excited about is the creation of new disciplines around voice/multimodal design and QA. These job descriptions are literally being created from scratch right now, so it's a huge opportunity for recent college grads. In addition, machine learning skills will be in larger demand to support ever-increasing accuracy in voice and intent recognition. Multimodal teams, including strategy, research, UX design (including voice and graphic), engineering (including web, native app and voice) and QA, will be highly interdisciplinary across all platforms. Product owners will have to be adept across all these disciplines, which will be a unique and sought-after skill.<

What should companies be thinking about doing today in regards to multimodal? How can they get started?

What's most exciting is that large incumbent companies have a major advantage over start-ups right now because they have all the data about customer interactions and what their customers want. They can use this data to train models that get their customers the responses they want. That data advantage will erode over time, so it's critical that companies invest in voice now. Most of the heavy lifting around voice recognition and machine learning platforms is being done by Microsoft, Google, Amazon, Facebook, and others and can be licensed. The place for companies to invest is building the intent/response models and the overall user experience design and deployment.

We don't believe in the concept for most use cases of creating stand-alone voice assistants (like Erica) but instead investing in voice-enabling their apps and websites, like Waze has done with a big mic button to transmit your destination. As we like to say, "Give your apps a voice."

To see presentations by Tobias Dengel and other speech technology experts, register to attend SpeechTEK 2020 today.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues