January 21, 2019
By Deborah Dahl Principal - Conversational Technologies
Standards

Integrating TTS in Web Browsers Is Harder Than It Sounds

Webpages are usually displayed on a screen and read by the user, but not always. People with visual disabilities often use screen readers, which take text displayed on a screen and read it aloud using text-to-speech (TTS). But webpage authors have little control over exactly how a screen reader will pronounce their text. You might think this level of control is unnecessary, because screen readers merely read the text that’s there.

But reading text out loud, pronouncing it correctly, sounding natural, and preserving the author’s intended meaning is actually a lot harder than it seems. A surprising number of differences exist between written and spoken language. Writing leaves out important aspects of speech—pauses, emphasis, and pitch—that are important for communicating nuance. The pronunciation of words themselves isn’t even always clear from the way they’re written.

This is especially evident in English, which has a lot of words with irregular spellings and pronunciations. Think of homographs, words that have the same spelling but more than one pronunciation (“bow and arrow” vs. “bow to the audience”; “lead the charge” vs. “the pipe was made from lead”). Take a look at the poem The Chaos by Gerard Nolst Trenité, where you’ll find tongue twisters like these:

Have you ever yet endeavoured

To pronounce revered and severed,

Demon, lemon, ghoul, foul, soul,

Peter, petrol and patrol?

Beside irregular spellings, new words show up all the time and quickly find their way around the Internet. Consider all the new internet acronyms—“lol,” “bff,” “imho,” and “yolo,” just to mention a few. How is a screen reader supposed to know how to pronounce these coinages?

The problem gets even worse in multilingual pages, because if the screen reader doesn’t know every language on the page, words not in the page’s main language will be mispronounced.

A better approach would move part of the screen reader’s task to the webpage author, who, after all, is the best authority on how the page should be spoken. Letting developers specify exact pronunciations would go a long way toward improving the user experience with screen readers and, more generally, with any TTS system. Even sighted users who are looking at a small screen, or who just can’t find their reading glasses, would appreciate accurate TTS.

How would we go about integrating pronunciation instructions? This could be complicated. The standard web approach to defining the appearance of a webpage is to add markup, and we could follow this pattern with pronunciations. But what’s the markup for pronunciations? Authors will need to be able to specify where there should be a pause and for how long. How would they define emphasis, a faster or slower speech rate, or a higher or lower pitch? And how would they specify the exact pronunciations of words?

Fortunately, standards exist for all of these tasks—even going back to the late 19th century! That’s when the International Phonetic Alphabet, which is able to precisely define pronunciations in thousands of languages, was first published. But from a web perspective, the most important standards for TTS pronunciation were developed in the early 2000s for spoken user interfaces, as part of the World Wide Web Consortium’s Speech Interface Framework. They include VoiceXML and its supporting languages, such as Speech Synthesis Markup Language (SSML) and the Pronunciation Lexicon Specification (PLS). SSML makes it possible for developers to modify TTS speech with instructions for emphasis, controlling pitch, adding pauses, and defining exact pronunciations of words. PLS enables developers to use a dictionary to define the pronunciations of multiple words.

Integrating these standards in webpage markup should go a long way toward making webpage TTS more intelligible, more natural, and better at conveying the author’s intended meaning. As a bonus, SSML markup can already be used for voice-first applications without a screen at all, including the Amazon Alexa Skills Kit, Google Assistant, and Microsoft Cortana. So text marked up with SSML can be reused for intelligent virtual assistants in addition to its potential use in webpages.

The World Wide Web Consortium is starting to look at publishing standard ways for defining pronunciations in webpages that would be expected to work in every browser. The work is taking place under the Web Accessibility Initiative as part of a potential Task Force on Spoken Presentation. For more information, and to follow the task force’s progress, check out the discussion on the mailing list public-apa@w3.org.

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

NLU Results Shouldn’t Be Proprietary

A common format for natural language tools would make everyone's life easier

20 May 2019

Integrating TTS in Web Browsers Is Harder Than It Sounds

NLU Results Shouldn’t Be Proprietary

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions