Speech Technology Magazine


Hold the Pickle

Over the years, I'd heard about planned deployments of automatic speech recognition (ASR) and/or text-to-speech (TTS) in drive-through facilities of fast-food restaurants.
By Judith Markowitz - Posted Nov 9, 2006
Page1 of 1
Bookmark and Share

Over the years, I'd heard about planned deployments of automatic speech recognition (ASR) and/or text-to-speech (TTS) in drive-through facilities of fast-food restaurants. To my knowledge, none of them ever came to fruition. When I recently learned about another unrealized project, I approached fast-food restaurant chains to hear what they had to say about the idea. Burger King, Wendy's, and KFC all told me they'd never considered using ASR or TTS in their drive-up operations. McDonald's PR department didn't believe the company had plans for that type of deployment either.

Exit 41 provides customer ordering solutions for the fast-food restaurant industry (it calls them "quick serve restaurants" or "QSRs"). Marcel Koster, Exit 41's product manager, crafts solutions for drive-through operations that include order centers that can serve multiple locations. Koster believes it would be fairly easy to integrate ASR and/or TTS into the Exit 41 drive-through ordering system but explains, "We have not seriously considered TTS and ASR." He then cited several challenges that ASR and TTS would need to overcome to be attractive drive-through solutions in QSRs.


"The biggest implementation challenge an ASR/TTS solution has is speed. Stores using our order center solution have their customers being addressed by an employee within three seconds and typically have order entry completed within 20 seconds," Koster says. A major goal of implementing any new ordering technology would be to reduce either or both of those times. Koster doubted that would be the case for ASR or TTS because "the solution must at least meet a human's ability to parse difficult speech for an accurate order" as quickly as or faster than a high school junior. He cited convoluted ordering patterns, such as "Number one with a Coke, no ketchup" (Coke does not contain ketchup) and "Number one with a Coke, no ice, no ketchup," spoken by native and non-native speakers of English who might also have questions about the menu.

While it might be possible to configure a good, server-based speech solution to handle the needed semantic and syntactic complexities of the dialogue, any extra questions, verifications, and clarifications would add time to the ordering process. That, by itself, would be perceived as unacceptable.


Another major challenge for both the order taker and the customer is the harsh speaking environment. Background clamor includes a plethora of car sounds (engine noise, horns, radio blare, and raucous passengers), all of which can be accompanied by gusting wind and the tune of the microphone being pelted by rain or snow. The transaction itself is completed using far-field microphones and muffled, scratchy speakers.

These communication challenges, which often lead the customer to shout into the microphone, are compounded by increases in noise levels and inclement weather. It's a well-documented automatic response humans have to noise called the Lombard effect. The Lombard effect is characterized by the following deviations from normal speech: increased vocal effort, increased formant amplitudes, increased vowel length, shifts in vowel formant locations, and deletion of some word-final consonants. Thus, it's not surprising that the Lombard effect is known to have a deleterious effect on speech recognition systems.

To perform well, an ASR system would require input devices that are superior to those that are generally used in drive-through facilities. It would also need to withstand the Lombard effect.


"Even if it could match a human's parsing ability," Koster says, "I do not believe consumers would be as satisfied with the solution. Most folks are already annoyed with having to deal with non-humans on the phone. Having non-humans interfere with one's ability to get food seems even less satisfying."

ASR and TTS already contend with this objection in other types of deployments. Its use for a communication environment that is already less than optimal in user friendliness reveals the strength of acceptance barriers at all levels—from the QSR, the integration, and end customer.

In my next column, I'll propose solutions for some of these challenges. Let me know of any suggestions you would like me to consider for that discussion.

Judith Markowitz is the technology editor of
Speech Technology Magazine and is a leading independent analyst in the speech technology and voice biometric fields. She can be reached at (773) 769-9243 or jmarkowitz@pobox.com.

Page1 of 1