Speech Technology Inches Closer to the Edge

Speech recognition is a complex technology, one that demands great processing power, high-speed connections, and very intelligent software. To date, most of the heavy lifting has been done in the cloud. However, a new generation of edge devices is emerging that is expected to push many processing functions from large data centers closer to users. The change has the potential to improve response time and lower costs, but the supporting infrastructure is immature and largely untested. As a result, companies should begin to dabble with edge technology but be aware that it is a work in progress at the moment.

The urgency stems from the fact that data volume is expanding at a mind-boggling rate. The collective sum of the world’s data is expected to grow from 33 zettabytes (1 zettabyte is equivalent to 1 trillion gigabytes) in 2018 to 175 zettabytes by 2025, a compound annual growth rate (CAGR) of 61 percent, according to International Data Corp. (IDC).

Companies, naturally, are struggling with processing, storing, and securing their expanding pools of information, and the rising amount of data is driving the edge computing business case. At a high level, edge computing is an architecture that combines cloud service located at data centers (often called the core) with edge computing devices near the end user that can autonomously satisfy a portion of the application functionality. Recent changes are occurring on both fronts.

Cloud computing offers companies a simpler way to deploy and manage computer infrastructure than legacy systems, but a disparity has been growing. The cloud centralizes computer processing in massive data centers. In 2018, 90 percent of enterprise data was created in the cloud and only 10 percent in the edge, according to Gartner.

However, by 2025, edge computing will account for 75 percent and the cloud will account for just 25 percent. As a result, a number of cloud infrastructure shortcomings are emerging. “Specialized speech applications, say connected cars, need to ship a lot of data to the cloud, but network availability and quality are not always guaranteed,” explains Deborah Dahl, principal at Conversational Technologies.

Privacy is another concern. “Increasingly, people have become suspicious of vendors’ motives and do not like their personal information going into the cloud because they have no control over it and no way of knowing what happens to it,” Dahl adds. Such uneasiness is especially prevalent in verticals like healthcare and financial services, but few industries are completely immune. In addition, consumers are worried that hackers could use speech systems as entry into their homes. Parents are especially anxious about what information is being collected from their children and how it is used.

Consequently, the speech industry has been looking to push computing power and storage capabilities out to the network edge. Speech platforms, such as Amazon Alexa, Google Assistant, Microsoft Cortana, and Nuance Communications’ Dragon, have been deploying edge technology essentially from the start of delivering their solutions. “There are a lot of smart end points, like speakers, auto infotainment systems, kiosks, and smartphones, that have voice user interfaces,” explains Dan Miller, founder and lead analyst of Opus Research.

However, the volume of computation and analysis occurring locally was small, consisting mostly of commands to wake up the system. One reason is their designs were developed several years ago, when edge devices lacked local processing and battery power.

As the industry moved forward, those barriers have been cleared, and limitations with the cloud have crystallized. Sending the bulk of a conversation to the cloud to be decoded, interpreted, and responded to can slow down response time, increase network infrastructure needs (and their costs), and introduce security concerns.

Edge Computing's Many Benefits for Speech Technology

Edge computing is evolving into a more appealing option because it analyzes data closer to where it is created and minimizes the movement of information from the end device to the voice recognition system. Smarter edge systems could support functions like audio capture; compression; transmission; language processing; and voice tracking. Also, putting larger subsets of words and natural language processing functions closer to the user creates many benefits, including the following:

It enhances application responsiveness because the system is not encumbered by network or cloud data center slowdowns.
It reduces internet bandwidth usage, sending simple text messages rather than complex voice recordings to the cloud.
It lowers costs, allowing companies to pare their networking costs because they transmit less information.
It reduces latency. Delays are problematic, and time is needed for the data to travel from the device to where the analysis is performed and return with the results. Moving data closer to the end point results in lower response times and enables select tasks, like adding an item to a shopping list or creating a reminder, to be handled in the background.
It better supports mission-critical applications. Processing is so quick that companies can deploy real-time applications that demand instantaneous data processing.
It provides offline availability. With the cloud, there is no guarantee that the network would always be available or reliable. With edge computing, the voice assistant processes certain commands and performs select functions, such as automatically sounding alarms and sending reminders, even if the device is in airplane mode or out of the coverage area.
It keeps data private, as vendors can put checks in place so user data stays local and is not sent to the cloud.
It complies with privacy mandates, like the European Union’s General Data Protection Regulation (GDPR), which limits where information can be stored; less movement means fewer potential problems.
It improves security, as edge systems are getting better at differentiating and identifying user voices. Local processing could quickly thwart someone trying to break in by resetting the system profile.

But for edge computing to become fully operational, a number of infrastructure upgrades are still needed. A good starting point is hardware. “The biggest challenge surrounding such things as local natural language processing is confining the application and data models to small footprints on portable devices,” Miller explains.

Suppliers must upgrade their edge hardware to give it more power. For instance, Amazon’s Echo devices use the company’s AZ1 Neural Edge processor, which requires 20 times less power and 85 percent lower memory usage yet doubles the speech processing capabilities. In addition, semiconductor suppliers, like CEVA, Fluent.ai, NVIDIA, Intel, and Syntiant, are developing special-purpose central processing units, graphic processing units, digital signal processors, and system-on-chip speech processing solutions designed to deliver needed processing power in small, energy-efficient form factors.

Traditional wireless wide area network (WAN) technology was not a good fit for edge computing. Recognizing the limitations, the International Telecommunications Union, 3GPP, and the Internet Engineering Task Force (IETF) developed IMT-2020, better known as 5G. It offers numerous enhancements, including the following:

Support for more devices: The new standard was designed for the edge. 4G networks support a maximum of approximately 4,000 devices per square kilometer; 5G works with a million.
Reduced latency: 4G latency typically ranges from 20 milliseconds to 30 milliseconds; 5G is 1 millisecond to 10 milliseconds.
Faster speeds: 4G operated at 1 gigabyte per second; 5G has top speeds up to 20 gigabytes per second.

Artificial intelligence and machine learning advances make speech systems ambient. After a keyword is detected, the device starts actively listening. More intelligence can be placed locally, so edge systems can better process information in noisy environments, such as a busy office. Emerging techniques separate the user’s voice from surrounding sounds.

For instance, beam forming processes audio from multiple microphones in order to focus listening in the direction where the user is. If an employee moves from place to place, voice tracking algorithms adjust the balance among microphone signals, so the system knows where the speakers are and picks up what they are saying.

Software also suppresses conversation interference. Similar to the way noise-canceling headphones work, the device accounts for barge-ins and suppresses music, even when it is played loudly.

Advanced edge computing power supports voice biometrics that prevent unauthorized users from entering information, making purchases, or changing key system settings. Such features are important in departments working with sensitive customer or employee information, such as human resources data or billing.

On-device AI speech recognition can perform advanced security functions. A device detects the sound of glass breaking and triggers an alarm. When connected to cameras, the sound triggers a close-up recording of the events.

Edge computing features are being added to smart devices, computers, printers, home appliances, lamps, office equipment, and toys. Users can enter commands to perform tasks, like printing a document, or help employees read important documents.

Edge computing offers potential cost savings. Application programming interface (API) calls for vendor speech recognition usually cost about $4 per 1,000 API calls. Placing the intelligence closer to the device eliminates them and lowers system expenses.

Edge Computing Is a Work in Progress

However, edge application development work is complicated, is in a nascent stage of development, and requires a more robust ecosystem. As data moves from the cloud to the edge, software complexity increases.

The growing diversity of hardware platforms and the communication protocols that they support also presents challenges, according to Dave McCarthy, research director for edge strategies at IDC. Keeping such information in one place—the cloud—is simpler than trying to coordinate it among multiple locations.

Scaling is an also issue. “Edge works well for applications with a small number of devices, but as vendors scale to hundreds or thousands, the model often breaks,” McCarthy says.

Compounding the challenge is a lack of standards. Currently, vendors are solving the problems in their own way, so software portability and development consistency are limited.

Software updates and maintenance become more complicated because data has to be synchronized at multiple locations. “If there is a small set of possible functions, say for a toy, updating should be straightforward,” Dahl maintains. “If an application is complex, say inventory, the work becomes more troublesome.”

As data moves from the data center to the edge, companies also need new management tools. Without them, they might not be able to monitor what is occurring at each step in a transaction, identify potential bottlenecks, and ideally fix problems before they negatively impact performance.

Finally, be aware that this area is new, so the support infrastructure and skill sets needed are largely missing. Few developers understand the new architecture, and best practices are only starting to be developed. In sum, a lot of work needs to be done with the ecosystem.

Edge Computing Finds a Niche

Because of the fledgling nature of edge speech systems, they are the exception rather than the rule. They are found in selective use cases, including ones that require the following:

speed, when systems need to be able to process data incredibly fast, such as real-time solutions;
lack of available bandwidth, when machines generate vast amounts of data that would be inefficient to send to a distant data center;
autonomy, when solutions need to be able to function without a network connection; and
compliance, when information must remain within a specific area to adhere with regulations.

As a result, the number of vendors focused on this area is small. In November 2019, Nuance Communications spun out Cerence, which became an independent, automotive software company. The Cerence Drive voice recognition system is used in 350 million cars, and its virtual assistant capabilities perform tasks like turning up the heat and finding the nearest Wi-Fi-enabled coffee shop.

Sensory’s edge solutions are embedded in more than 3 billion products from hundreds of consumer electronics manufacturers, including ATT, Hasbro, Huawei, Google, Amazon, Samsung, LG, Motorola, GoPro, Sony, Tencent, Garmin, LG, Microsoft, and Lenovo.

So what does the future hold? “I do not see most speech applications using edge technology, but those that require low latency, privacy, and security will find it attractive,” Dahl concludes.

Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at paulkorzen@aol.com or on Twitter @PaulKorzeniowski.

Companies and Suppliers Mentioned

Speech Technology Inches Closer to the Edge

Edge Computing's Many Benefits for Speech Technology

Edge Computing Is a Work in Progress

Edge Computing Finds a Niche

Aircall Acquires Vogent

Grok Voice Mode Comes to Apple CarPlay

Krisp Launches VIVA 2.0, an Infrastructure for Voice AI Agents

DomoAI Launches TTS and Integrates OpenAI's GPT Image 2.0 in Talking Avatar Workflow