What’s Next in Speech Technology? I’m Glad You Asked!

About a year ago I decided that the best way to keep track of what might be coming next in the world of speech technology—on the practical side, not the theoretical side—would be to track queries about it. And what better way than through Stack Overflow?

Stack Overflow oversees a sprawling empire of websites dedicated to everything a developer needs to know. Users log in and post questions that anyone can see and sufficiently vetted users can answer. Sometimes the quality of questions is low and the editors delete them, but very often you can find a question similar to yours, with complete details, usually along with one or more solutions to that problem. Users, including not just the original poster but anyone registered to use the site, “upvote” and “downvote” the answers. At least once a week—and often many times a day—I peruse Stack Overflow to find answers to obscure questions.

I have learned a few things that I’ll share, even though I find some of the lessons disconcerting.

First, it would seem that CCXML developers are thin on the ground; I counted precisely one CCXML-related question in all of 2017. Perhaps CCXML developers ask questions directly of their system providers; I know I do. With only a handful of CCXML system providers out there, I expect the pool of experts is not all that large. I knew my area of expertise was narrow, but this narrow?

VoiceXML developers, on the other hand, seem a bit more prevalent. I counted about 100 questions in the past year. Pro tip: If you tag your question with “VoiceXML,” it will show up in my RSS feed—and if you don’t tag it, the only way someone will even find your question is by stumbling across it, unless of course they like to peruse the site every day looking for VoiceXML-related questions. I don’t, and I missed most of these questions. After all, in a typical day Stack Overflow receives 8,000 questions.

VoiceXML questions reveal a wider range of interests than CCXML. Most relate to telephony: enabling telephony applications of various sorts (answering services, air quality information, travel services). A number of questions pertain to integration: how to integrate VoiceXML with Asterisk, or how to integrate with Python or PHP database queries. Some of the questions are elementary and uninteresting, the developer equivalent of “Where’s the on/off switch?”

The most enlightening tag at Stack Overflow is “speech recognition”; I counted more than 1,000 queries for 2017. Over the years speech recognition has consistently accounted for 0.02 percent of all queries. But there’s been a slight tick upward over the past couple of years.

VoiceXML and CCXML don’t even appear in any of my random samples of queries about speech recognition in general. A great number of queries discuss integration of automatic speech recognition (ASR): how to integrate with Android or the Google speech API, Bing API, Watson, and so on. These questions enlighten and inform me: As I read them, I gain clarity as to just who offers what in the way of speech recognition services, some of the limitations and pitfalls, and a sense of what ASR services work best. Services-based ASR implies (or rather make that “requires”) a network-connected device, and while some queries revolve around mobile phones, many imply other devices.

The question that comes to mind: Just what sort of devices are these developers creating? This gets back to what I was most interested in—the practical application of speech technology—but I’m afraid I can’t answer that question. VoiceXML developers sometimes ask specific grammar-related questions, but aside from some hints here and there, device builders tend to query about integration and capabilities rather than grammars or specifics about speech itself.

Perhaps the most useful thing gleaned from the speech recognition questions is the discovery of new and interesting projects and services related to ASR and speech. I’ve seen questions about pocketsphinx and cmusphinx; about online services that accept speech and output commands to network-connected devices; about neural net APIs; and more besides. I was perhaps most surprised, in a very pleasant way, to find new projects to provide open-source tools for developers of speech technology itself.

So despite not fulfilling my original goal, my experiment yielded useful results. I discovered a single fount of information containing hints, tips, and answers, along with references to services, tools, and resources, all based on real-world experience. That’s a win in my book.