Debunking the Most Common Myths in Voice Technology

Article Featured Image

Demand for voice-enabled products has surged amid the pandemic, as people spend more time inside with their smart home devices and less time wanting to touch anything (especially anything that's shared with other people). While voice user interfaces started with smart speakers, the market has grown as consumers crave a variety of voice experiences at home and on the go. In addition to consumer products, demand is rising for voice-enabled point-of-sale displays, industrial applications, and many more enterprise use cases. 

In the rush to capitalize on the demand for this technology, businesses can easily fall prey to common misconceptions about making voice-enabled products. Buying into these assumptions can lead to products that fail to use voice technology to its greatest potential, or even worse, result in a shoddy product that doesn't respond to users' voices.  

I’m here to debunk these myths one by one, including what actually makes for better voice performance, and what customers are really looking for from voice-enabled tech. 

Myth 1: Consumers only want to use voice with smart speakers.  

We're only scratching the surface of the uses for voice technology. The truth is just about any device with a user interface can benefit from the addition of voice control. In fact, the more complicated the voice interface the larger the benefit for adding voice control. 

Take fast food restaurants that recently implemented touch screens, which no customer wants to use during a pandemic. Patrons could use their voices to order inside the restaurant, like they do with drive-thru ordering, but through a one-to-one, natural language experience. Similarly, inside grocery stores, imagine if patrons could compare products, shop from recipes, and receive answers to their product questions, all through kiosks that respond to their voices.   

Or, what if inside every home was a voice-controlled microwave? Think of how much easier it would be to make a meal on demand while working from home or enjoying more time with family. 

The reality is that there is so much untapped potential for this technology to simplify people's lives, and consumers are hungry for well-made voice-enabled products that solve real problems. 

Myth 2: It’s necessary to choose one type of assistant and stick with it.

Not long ago, product makers had to pick the voice assistant with which they would work, and once they picked it they were stuck with that assistant. This is only recently starting to change with the introduction of Amazon’s Voice Interoperability Initiative, which gives customers the opportunity to interact with multiple voice services. 

This interoperability is expected to foster the idea of voice hubs, in which companies can work together to provide a consistent voice-user experience with all of their smart devices. Now, product makers can support different assistants geared for specific tasks. For example, one assistant could manage making a recipe from a cookbook and helping with other kitchen tasks, while perhaps another could be for listening to the news and selecting the perfect playlist.

This interoperability can also help reduce the friction and help in the broader expansion of voice. Smart TV manufacturers can focus on implementing voice activation of programming content while leaving out user actions such as music playback, which can be done using another wake word engine. Similarly, a smart refrigerator can use an audio front end with Samsung's Bixby to check the food inventory, but also be integrated with Amazon's Alexa to read the news and play their Spotify music. This interoperability can help speed up the voice industry innovation while also allowing companies to work together to reduce the fragmented user experience working with multiple cloud assistants in parallel. 

This flexibility opens up so many possibilities of which businesses can and should take advantage. 

Myth 3: More microphones equals better performance.

Adding more microphones is not a silver bullet for improving performance. In fact, there are diminishing returns. In most consumer applications, such as devices in homes and cars, performance tops out at about four mics before the increase in quality becomes so marginal that it's not worth the time and money investment.

Microphone configurations and processing algorithms that elevate the user’s voice above the noise in the environment can also significantly improve voice recognition accuracy, more than simply increasing the number of mics. It's important to consider the impact of signal-to-noise ratio and the estimated direction of arrival of the prominent source that can help optimize the beamforming algorithms typically used to help microphones focus on the voice and ignore sounds coming from other directions.

Canceling out background noise is a hugely challenging problem in front-end design, and it's not solved by simply adding more mics. In a design like a doorbell or a security camera, for example, robust noise cancellation algorithms such as the adaptive interference canceller (AIC) can be more beneficial than including more microphones in the design. The design might also require advanced noise cancellation algorithms such as AIC to eliminate unreferenced noises so the device only hears the user's command.

Myth 4: Machine learning is only for wake word and event detection.

Product makers who only think to use machine learning for wake words and event detection are letting a powerful tool go to waste. The reality is that machine learning can be used for so much more, including audio source detection and noise cleanup. Machine learning is driving many audio innovations, including helping devices detect when a person is talking and distinguishing speech from noise.

In the case of video conferencing, machine learning is very useful for eliminating background noise. The current trends of remote work and video calls provide data for businesses to improve audio processing on the edge or on the cloud. Many algorithms in the signal processing chain, such as a voice activity detector (VAD), noise suppression, echo cancellation, etc. that are traditionally done using digital signal processing algorithms can now use machine learning and deep learning to improve performance.

Myth 5: The voice performance of the product hinges on the audio front end. 

The phrase audio system has a lot of meaning. A system includes multiple components all working in unison for optimal performance. This is what makes developing a voice-enabled product so challenging. Microphones have to be properly ported, sealed, isolated from speakers, noise-free, synchronously clocked, matched in level, and integrated with the main processor. And this is just for the microphones! You also have to contend with loudspeakers, the audio front end, real-time interrupts, voice recognition engines, and application software. Any single failure is the weak link in the chain and will lead to a disappointing user experience. As you plan out your next voice-enabled product, make sure that you give careful consideration to system level debugging.  

The rise in demand and capabilities of voice-enabled technology opens up a world of potential for businesses and products. While that world is vast, it doesn't need to be out of reach. With the right approach and a little savvy about common pitfalls and assumptions, businesses can craft the types of voice-enabled products that work excellently and rise to the top of customers' wish lists.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues