Why Can’t Speech Tech Have a New York Accent?
If you’ve ever heard me speak, it doesn’t take a lot to figure out which part of the country I call home. My accent unmistakably gives me away as an unapologetic New Yorker.
My accent is far different than those of some of my friends and former colleagues on Long Island, and it also varies significantly from a Hudson Valley accent, though all three regions within New York State are only a few miles apart.
Even within the confines of New York City, there are subtle differences between speakers in the Bronx and Brooklyn or between people on Staten Island and on Manhattan’s Upper East Side.
I had never really given it much thought in the context of speech application development until reading this month’s feature, “Accents Still Elude Speech Recognition Systems, But for How Long?”, but now I understand why U.S. English text-to-speech systems do not pack a New York accent. Nonetheless, I think they should.
The article points out that the number of people who speak with a particular accent is a subset—often a small one—of everyone using the system. Speech technology suppliers, it says further, need to start the process by limiting input to individuals who speak with an accent.
With a population of 8.8 million within the confines of New York’s five boroughs, and nearly 19 million if you include the outlying suburbs, one would be hard-pressed to call the New York metropolitan area’s cohort a small subset of any U.S. population group. Add the number of New Yorkers who have moved to other parts of the country but have managed to hold on to some of the native nuances of their speaking style and the case for speech apps with a New York accent just gets stronger.
Instead, most current speech apps have as their default what is commonly referred to as Standard American English (SAE). SAE can be heard on most newscasts throughout the United States because its proponents say it is easily understood by the vast majority of Americans. But SAE is generally distinct from the casual, conversational English spoken by most people across the nation. New York English is a far more widely used dialect and, to my mind, would make a far better default.
Changing speech systems to accommodate a different accent or speaking style would have been a far bigger challenge just a few years ago. But, as our cover story, “The Low-Code/No-Code Movement Builds in Speech Technology”, points out, the challenge is far less daunting today. The story makes the point that the availability of no-code/low-code developer platforms means that speech technology vendors and users can “more efficiently offer their products to a wider user base by easily enabling language customization and other capabilities.”
Despite this, the accents feature maintains that “in the end, the work to have [speech engines] account for local dialects is dynamic and excruciatingly complex.” Vendors, it says, “are dabbling with new engine designs that could close the gaps one day.… However, the challenges remain significant, and the quest is a bit quixotic.”
So the challenge I’m issuing for the speech industry is to craft a text-to-speech voice with a truly convincing New York accent. There have been some attempts so far. TikTok had one, but it wasn’t very realistic, in my humble opinion. Narakeet reportedly offers a New York City-accent voice generator, African-American text-to-speech voices, and several northern and southern U.S. English variants. I haven’t had the chance to try them out yet, but that is on my to-do list in the next few weeks. But as a whole, the options for TTS voices with an authentic New York accent seem to be quite limited. As an industry, I’m sure we can do better. There are certainly enough New York English speakers to feed the models.
Leonard Klie is the editor of Speech Technology magazine. He can be reached at firstname.lastname@example.org.