Speech Technology as Hyperpolyglot? Soon, That Will Be No Hype
Across the globe, many people speak three or four languages out of necessity. A polyglot is generally accepted as one who speaks five or six languages at a high level; a hyperpolyglot is generally accepted as one who speaks more than six.
Having the ability to converse in multiple languages in real time is necessary for everyday living in many parts of the world—and it is a major business advantage. Hyperpolyglots can quickly get to the root of problems and put diverse teams from across the globe on the same path. The key to gainfully using multiple languages in a business context is that everyone is in the same industry, focused on business processes that are understood by all. This reduces cultural considerations that complicate the context of conversations.
An interpreter at the United Nations who fluently spoke well over 15 languages once counseled me that being able to simply “hear the words and speak them in another language” was dangerous in multinational negotiations. To be helpful, “you must understand context!”
Humorous examples exist in most languages of how misunderstandings can occur without proper context. The positioning of one word or even the same words being stressed in a different manner will completely change the speaker’s intent to the point of reversing the intended meaning.
New goals for speech technology are constantly being set, and interpretation is one of the highest-priority targets. Fantastic strides have been made with interpretation from one language to another (and vice versa), with many language pairs rendered astonishingly accurate when constrained to certain accents. To clarify, interpretation is translating spoken language; translation is converting the written word. Therefore, a bidirectional conversation involves interpreting the input speech into a second language (speech to text), and then translating the output text into audio.
To break down the challenges of polyglot or hyperpolyglot interpretation, the basics of understanding various accents of the input language is necessary for even simple speech recognition input. Companies large and small are working on solving this challenge. An example is Speechmatics, which started with the many dialects and accents of English to create a Global English offering, then followed with a Global Spanish offering. Its system was trained using thousands of hours of spoken data from more than 40 countries and tens of billions of words drawn from global sources. Given that digital voice recordings are abundant and growing for the top 30 most spoken languages, finding sources for training is no longer an impediment.
The middle portion and arguably the most difficult piece of this puzzle is context. Even if the recognizer has accurately detected the input language and accent, without context it cannot accurately translate the output speech. Why is this? Primarily because of homonyms. Homonyms frequently exist in all the top 30 languages. True homonyms result in homophones plus homographs—the word sounds the same and is spelled the same. Weak homonyms result in only homophones but present the same challenge of understanding the word intended by the speaker.
This is where artificial intelligence’s heavy lifting comes into the picture; AI is where nearly all partnerships in interpretation are occurring. So if you are interested in machine interpretation, watch for these partnership announcements; they’re where the greatest strides are being made.
Now add in the need to produce an output in the other language(s), with an accent that can be understood by the listener. Most approaches seem to use a standard accent that is universally understood but can be adjusted upon request of the hearer—for example, starting off with American English but capable of switching to British/Aussie/Kiwi/South African on demand.
As hardcore Star Trek fans know, the “universal translator” was first used in the late 22nd century on Earth for the instant translation of well-known Earth languages. It seems that speech recognition’s time for interpretation might come nearly 200 years faster than optimistic sci-fi writers imagined. x
Kevin Brown is a customer experience architect. He has more than 25 years of experience designing and delivering speech-enabled solutions. You can reach him at firstname.lastname@example.org.