Gearing up for the Grid: Speech with No Strings Attached!
The Internet and Internet technologies, like Java and XML, have had a profound effect on software architecture. The changes start out modestly with Client/Server technologies to distributed object brokerage and object marshalling. Protocols like MRCP have begun to address speech resources in an effort to bring standardization. Speech and IVR application languages have increasingly adopted a declarative approach. VoiceXML and CCXML have drifted from the traditional procedural paradigm. It is only likely that any future Internet technology will change the way speech is implemented. The questions that beg to be asked are: What's the next big thing brewing in the Internet? What is the next major leap forward? And, of course, what is going to be its impact on speech deployments?
Sun is betting that after "The Web," the next big thing is going to be the "The Grid." IBM, for that matter, Oracle, and many research campuses in computer science, genetics and proteomics are excited about the possibilities with the Grid. Many grid computing products are already available commercially and from research labs. Sun has made its N1 grid engine 6 available. GLOBUS Toolkit is an open source toolkit that can be used to build grids. Oracle Grid is advertised to turn 64 PC servers into a mainframe. Google Labs too, offer Google Compute.
What will be the impact of the Grid on the speech industry?
What is "The Grid"?
The grid is a collaborative attempt that makes CPU and storage resources available on demand. It is a step beyond information organization and sharing (the WWW). It is organization and sharing of computing and storage resources. It attempts to create a super computer out of lot of smaller desktops, only better, more reliable and more available. I found the following definition on the Web at http://www.grid.org/:
"Grid computing is a form of distributed computing that involves coordinating and sharing computing, application, data, storage, or network resources across dynamic and geographically dispersed organizations. Grid technologies promise to change the way organizations tackle complex computational problems."
Does Speech recognition/synthesis/verification really need a super-computer?
No and yes! Almost all commercial speech recognition engines embrace recognition grammars. Recognition grammars, essentially, constrain the possible utterances that are recognized. The problem of recognition is further made tractable by pruning in Viterbi search space. Recognition grammars, on one hand, make the problem of speech recognition tractable by allowing only a subset of utterances, on the other; it limits the IVR system in accepting free form conversational responses. SLMs attempt to mitigate this problem, but, as we will see, they are almost as limiting. It is very possible to write grammars that cover all valid English sentences and beyond. Recognition using such grammars may not be tractable on a desktop in real time with acceptable quality while maintaining user independence. There is a definite case for speech recognition and verification on the Grid.
Speech synthesis is a much easier problem in terms of computing resource requirements. With the advent of concatenate approaches to synthesis, synthetic speech has become more human like. It is less likely to derive tremendous benefit from the Grid.
Super-computing resources are classically deployed to attempt solutions for non-linear optimization problems. The problems are often exponential in time and are usually posed as a search in an n-dimensional hyperspace. Often the problems are NP complete or NP hard. NP complete problems are the class of problems that would take exponential time to solve, but, given a solution, it can be verified in polynomial time. Such problems may become tractable using the Grid. Genetic algorithms follow a similar approach. A gene-pool of possible answers is generated by mutation or re-combination and a lot of small computers get busy in verification of the solution.
Speech recognition is a two-step process. Typically, the first step involves feature extraction from the utterance. The second step is a search and optimization process. It is a search and optimization performed in the space of all possible utterances allowed by the grammar to find the best match. Not surprisingly, the grammars are tiny - digits, credit-card numbers and yes/no. Statistical Language Models (SLM) attempt to recognize the likely paths a priori. They chart out the likely responses from the data collected in advance.
Possible changes to architecture of speech recognition algorithms
Unprocessed speech input is less likely to be analyzed in parallel. A good approach would first extract features. Extracted features are a more compact representation of utterances, which retains enough information to perform recognition. The utterance in this compact representation can be distributed on different nodes on the Grid. The search and optimization step will reap most of the benefit from the Grid. Following are some of the approaches that are likely to be adopted by speech recognition engines:
- Linguists can attempt to partition the grammar for entire language, so that each node performs search in a sub-space.
- Multiple nodes may perform Viterbi search in the entire space as defined by the complete grammar, but would often share their partial results to weed out unlikely alternatives.
- Genetic Algorithmic approach.
These approaches will open doors for free running recognition. Voice recognition will be perceived not as a convenient alternative to DTMF, it will indeed become the norm - the only natural input. This will eliminate need for too many custom grammars. Customization will simply require addition of domain specific (proper) nouns. Almost anything will be recognized.
What would the Grid mean to companies that host speech applications?
Companies will not have to own computing infrastructure. Computing and storage time would be rented, just as media content from cable companies or electricity from the local utility company. IT infrastructure will cease to be a depreciating asset. It will be an expense that can be written off and will have obvious tax benefits. The dynamic availability of computing resources inherent in the Grid will make provisioning simpler and more available. Custom grammars, that are a large part of customization costs, will be simpler. Customization will build over and above language models that will accept almost all utterances in a language. Costs associated with speech enabling applications will reduce drastically.
The Grid is coming - and so is unconstrained speech recognition!
Sumit Badal is senior staff software engineer with Nortel, and works in the multimedia applications division in Bohemia, formerly, Periphonics. He is a part of the Speech Server/OSCAR team. He has a master's degree in computer and information sciences from the University of Massachusetts at Amherst. His past research has been in the field of computer vision and robotics.