Captioning Sees an Impressive Pace of Progress

Real-time captioning, also called computer-assisted real-time translation (CART), is often taken for granted as a readily available means to textually supplement speech as it is spoken in real time, in a medium like video. But for the spoken word to be more quickly and precisely translated in a live context, that requires a lot of technological innovation.

Problem is, while real-time captioning (RTC) has made strides, the end product often can’t live up to modern expectations or increased demand from content consumers across devices who anticipate 100 percent accurate and instantaneous captioning.

But while plenty of challenges remain, many experts agree that the pace of progress has been impressive in recent years, especially considering how far we’ve come from the early and more primitive days of RTC. Indeed, the future appears as bright as a luminous white subtitle emblazoned across the bottom of your screen.

“Real-time captioning is more accurate and smarter than ever, thanks to advancements in artificial intelligence and how people consume video,” says Karen Van Wert, product manager at Qumu, a provider of cloud-based enterprise video technology. “Nowadays, people are no longer just getting captions from their television. Video is now on our phones, planes, computers, and all that content must be accessible to everyone, regardless of hearing ability or what language they speak.”

Consider that decades ago RTC on television or during live events was produced only by carefully trained stenographers.

“A human transcriptionist would be hired to listen to the broadcast and transcribe what is being spoken live, on the fly. Custom keyboards were even invented to make human transcriptionists faster, so that they could keep up with what was being spoken,” says Dylan Fox, CEO of AssemblyAI, a provider of automatic speech recognition (ASR) technology.

These transcriptionists were supported by a broadcast engineering-based team and infrastructure responsible for making the connections, through POTS lines that delivered the caption data to the clients’ physical encoder, according to Chris Antunes, co-CEO and cofounder of 3Play Media, a provider of online video captioning services.

“A decoder box connected to your television was required to view the captions. Not surprisingly, only very few broadcast programs were live captioned, as it truly took a village,” Antunes adds.

Eventually, evolving tech allowed for decoder chips to be included in television sets, expanding accessibility beyond just those with a decoder box.

“Speaker-dependent voice recognition technology revealed an entirely new way to create real-time captions, expanding the capacity beyond the limited resources of specially trained stenographers,” Antunes continues. “IP delivery of caption data reduced the need for a large support team, allowing captioners to connect from anywhere and transmit data to encoders directly via the Internet. A larger group of people able to perform real-time closed captions with less infrastructure support and more televisions able to display captions drastically increased accessibility.”

Gradually, RTC moved away from hard-coded analog captions to digital closed captioning. And recent advances in machine learning and ASR technology have made it possible for artificial intelligence (AI) to accurately transcribe speech in real time.

“Nowadays, this technology is reliably used in broadcast television, live streaming, online video, and other industries to provide real-time, automatic closed captioning, which greatly improves the accessibility of audio and video content,” Fox says. “With the enormous amount of audio and video content being created online today across platforms like YouTube, Clubhouse, and TikTok, for example, it would be too costly and time-consuming to have humans provide real-time captioning, and automatic captioning technology is the only possible solution.”

What’s more, the process of RTC can now be done remotely, “meaning the captioner does not have to be in attendance at the live event. Remote real-time captions are created by transmitting the necessary information using microphones, telephone lines, software, and the Internet,” notes David Ciccarelli, CEO of Voices, an audio services company.

In 2021, the transcription industry is a $30 billion market that has immense potential for innovative solutions. Besides the tech giants like Google, Amazon, and Microsoft, which have all built speech recognition engines that can deliver speech-to-text in real time, major players in this space include Nuance Communications, Captions Unlimited, Wordly, EEG, CaptionHub, Rev.com, Verbit, VITAC, AssemblyAI, CaptionMax, NCI, and Ai-Media.

“Most industries still use manual captioning processes that are highly inefficient and costly,” says Ariel Utnik, chief operating officer and general manager of Verbit, “and there is a huge opportunity for us to modernize these verticals with AI-driven solutions.”

Automated RTC, which relies on real-time ASR, has benefited from many of the same advances that have occurred in batch ASR algorithms over the past five years, according to Antunes.

“This includes the use of deep neural networks for both acoustic and language modeling, as well as the ability to use very large amounts of training data to optimize these models,” he says. “Of course, in real-time ASR, there are limitations in how much context can be used when running these algorithms, as this trades off against latency. As such, batch ASR, which can use sophisticated adaptation approaches to essentially use an entire video as context, generally achieves about a 20 percent lower word error rate than would a real-time implementation of the same core engine. Still, we have observed that, on average, real-time ASR can achieve about 85 percent accuracy on long-form video content, up from the 75 percent that was typical as little as five years ago, which translates to a 40 percent reduction in word error rate.”

Kjell Carlsson, executive vice president of product strategy at Stratifyd, a provider of AI-powered omnichannel analytics solutions, attributes much of RTC’s recent progress to advances in machine learning and recurrent neural networks, which have become commercially feasible at scale since the early 2010s.

“Transformer networks, a new type of semi-supervised deep learning model, are poised to take natural language processing (NLP) and natural language generation capabilities even further. These transformer networks also have capabilities for machine translation, opening the tantalizing possibility of highly accurate real-time translation into multiple languages,” Carlsson says.

NLP uses rule-based modeling of human language, or computational linguistics, along with statistical, machine learning, and deep learning models to enable computers to not only process text and voice data but to understand the meaning behind it, in addition to the creator’s intentions, Ciccarelli explains.

“NLP can be found in voice-operated GPS systems, spam identification, translation tools, digital assistants, dictation software, and customer service chatbots, which has led to businesses utilizing its capabilities to streamline operations, increase productivity, and simplify processes,” Ciccarelli says. “The advancements in NLP have allowed businesses to rely on automated, AI-powered tools that do a convincing job of replicating human language and sentiment.”

Ron Jaworski, CEO of Trinity Audio, agrees. “The improvements in machine learning and deep learning serve NLP in this context significantly. We can see it in the voice recognition algorithm and regarding text-to-speech quality, which was stuck in the same place for the last 20 years before making giant leaps in the last three or so years,” Jaworski says. “The option to teach the machine by recurrent feedback and then have the ability for the machine to improve itself with the actual feedback is pushing RTC and other text/voice analysis to new levels.”

Some also credit innovations in emotion detection with helping to further enhance RTC capabilities.

“Emotion recognition technology uses AI systems to scan human faces, again pulling from an immense collection of data against which it matches and evaluates nervousness, empathy, dependability, etc.,” Ciccarelli adds. “The scores for emotion recognition are staggering, with more than 85 percent accuracy for the basic emotions like happiness, anger, disgust, and neutral.”

Despite these advancements and the emergence of new tools, technology-only transcription and captioning platforms remain merely 75 percent to 80 percent accurate, according to Verbit’s Utnik, which is why a hybrid approach—using human transcribers in conjunction with technology—is often needed, especially to grasp speech context, understand heavy accents, and transcribe when audio quality is low.

“Our company has developed an online collaborative captioning platform where multiple transcribers come together to simultaneously review and fine-tune AI-generated transcripts. While the AI component is fast and learns from its mistakes through self-learning algorithms within minutes, it also excels when exposed to repetitive functions, which makes for a strong, near-perfect transcription and captioning end product that delivers 99 percent-plus accuracy,” Utnik says.

Who Stands to Benefit?

The deaf and hard-of-hearing continue to reap the most rewards from RTC’s growth and development. But many others stand to gain from improved technology, too.

“The pandemic has created major changes in how we work and how we consume media. Many more people are using captioning as they multiscreen or watch content at accelerated speeds. People have become used to captioning as a way to better understand and follow what they’re watching, and overall, companies have recognized the importance of making content more accessible and inclusive,” Utnik says.

“This has made captioning an expectation among viewers of all abilities, just like curb cuts and automatic doors—originally accommodations for wheelchair users—have become everyday requirements for everyone,” Antunes says.

Furthermore, RTC can help those who use English as a second language better follow live meetings, classes, and events and aid people who are visual learners to read along as the words are spoken. Plus, it can be especially valuable to people with cognitive disabilities like dyslexia and attention disorders.

“Captioning has proven to improve transferability of knowledge, retention, and comprehension for all viewers,” Antunes adds.

RTC is particularly useful as noise pollution increases “and people try to attend virtual events from trains or crowded coffee shops,” Van Wert says. “Advanced real-time captioning, with clear indicators of who’s speaking and when, provide everyone the opportunity to engage with video when they can’t listen in. You’d be hard-pressed at this point to find someone who doesn’t rely on captioning, at least some of the time, and that’s a remarkable change in the last decade.”

Whether you’re a content creator, technology solutions provider, or other enterprise with a stake in RTC, it pays to be aware of laws and rules in place. RTC is currently a legal requirement under several U.S. accessibility laws, including Section 508 of the Rehabilitation Act, Federal Communications Commission (FCC) regulations for broadcast, the 21st Century Communications & Video Accessibility Act (CVAA), and the Americans with Disabilities Act (ADA) Title II & Title III.

“Thanks to the FCC, the technology to decode captions was required to be built into electronics beginning in 2012. Broadcasters, cable companies, and satellite television service providers are mandated to provide closed captioning for 100 percent of all new, non-exempt, English language video programming,” Ciccarelli says. “And companies, government offices, courtrooms, schools, and others are obligated to provide RTC services for the deaf or hearing-impaired.”

In addition, captioning might be required for services that are open to the public, from concert venues and university lectures to employment, healthcare, and legal services, according to Fox.

“Further, U.S. courts and legislation are moving toward more universal adoption of the Web Content Accessibility Guidelines (WCAG) 2.0 Level AA, which requires captioning, transcription, audio description, and real-time captioning for video,” Antunes notes. “This is considered the gold standard that companies should achieve to be compliant with the law.”

Lingering Challenges

Janice Lintz, CEO of Hearing Access & Innovations, was responsible for creating the caption standards recommended by the Association for National Advertisers and essentially adopted by the FCC. She says that while RTC is now available online and has improved, “the accuracy still isn’t terrific. It appears better for those who are filling in a word or two or using it when they can’t use sound; but for people who are dependent on it, it is still frustrating.”

Ask Van Wert and she’ll tell you that “it’s kind of a chicken-and-egg situation because companies know they need better captions to drive viewer engagement but can only move as fast as the technology allows.”

Part of the problem is that beyond raw word recognition accuracy, both batch and real-time ASR are still limited in their ability to apply higher-level intelligence to the captioning problem.

“In general, the topic of the video is unknown, so even the first-level strategy of having a topic-optimized language model is not feasible. Transformer language model architectures are helping with this, but they are not yet as effective as building a model targeted at a particular domain,” Antunes says. “Even more challenging is acoustic modeling, since the speaker is inherently completely unknown in the captioning application. Thus, while deep neural networks have adapted well to a consistent acoustic field, they are less robust when the acoustic conditions are changing throughout a video, and this is exacerbated in a real-time situation.”

AI captioning may have improved considerably, but it’s never going to be perfect without a human touch, believes Peter Yagecic, executive director of technical projects at Situation Interactive, a digital agency that represents several nonprofit and educational organizations that use RTC.

“AI is getting better at accuracy. But services that allow you to seed machine captions with specific words that machines are never likely to guess correctly, like drug names for a pharma company or unique proper names, are critical for success,” Yagecic says. “Also, in our experience producing virtual events, we’ve found that most AI captions fail horribly when presenters are singing, which is not uncommon for any streaming event that incorporates entertainment. That’s why, for many of the events we work on, we prefer human captioners and provide our captioner with all the prerecorded content and proper nouns prior to the event so they can generate their base captioning script.”

Deborah Dahl, principal of Conversational Technologies, stands firm in her conviction that RTC will improve by leaps and bounds in the coming years.

“Captioning will benefit from better and faster speech recognition. Speaker identification and diarization—separating the speech of different speakers—will be better at tagging captions with the name of the person speaking,” she predicts.

Dahl hopes one particular improvement is forthcoming soon: adding captioning in a “second screen” scenario.

“This means that viewers will be able to see captions on a tablet or phone instead of the main screen. This would let several people watch a program with captions in different languages, even potentially in a movie theater,” she says.

Many also eagerly anticipate enabling RTC in sign language instead of text.

“This would be helpful for the many deaf people who are more comfortable with sign language than written language,” Dahl says.

Ciccarelli is encouraged that late last year Google announced that four new languages (French, Spanish, Portuguese, and German) would be offered in addition to English as real-time captions in Google Meet.

“As the evolution of real-time captioning continues, one thing I’d like to see is a continued expansion of real-time captioning languages offered across other companies, platforms, and social media,” Ciccarelli adds. “For that to happen, progress needs to be prioritized and viewed as a win for all: not only for the companies who are pushing forward this technology but for those who are deaf, hearing impaired, and speak other languages as well as those who don’t require additional accessibility.”

Many are impressed by the increased accuracy of RTC but acknowledge that capabilities have plenty of room for improvement.

“That will change as companies continue to see value in investing in caption technology,” Van Wert says. “One of the hurdles is that RTC is often cost-prohibitive. The three main modes of live captioning—machine learning, AI, and live person—all have markedly different price points, inhibiting many from having access to the best captions possible. Fortunately, I think we’ll see costs go down dramatically as smarter technology increases caption accuracy.”

Erik J. Martin is a Chicago-area-based freelance writer and public relations expert whose articles have been featured in AARP The Magazine, Reader’s Digest, The Costco Connection, and other publications. He often writes on topics related to real estate, business, technology, healthcare, insurance, and entertainment. He also publishes several blogs, including martinspiration.com and cineversegroup.com.