Tips for Adding Speech to Your Metaverse Presence

Article Featured Image

With more companies already in or considering a presence in the metaverse, speech technology capabilities are top of mind. Speech, of course, is a key element of the kind of realism that will make the metaverse work and feel truly immersive.

While the concept of the metaverse is still relatively new to many, its key components—including speech technology—are already familiar to many users, says Percy Grunwald, cofounder of Hosting Data, a U.K. web hosting service provider.

“Virtual assistants, such as Amazon’s Alexa and Google Assistant, have been integrated into various virtual environments, allowing users to control virtual objects and access information using natural language commands,” Grunwald says.

Other entities, like Mozilla, the company behind the Firefox web browser, and High Fidelity, a social virtual reality company headquartered in San Francisco, “are developing open-source speech technology platforms that can be used by developers to create more immersive and interactive virtual environments,” he adds.

In addition, a wide range of organizations are experimenting to bring more realistic and immersive experiences to users through their metaverse settings.

Gaming, of course, represents the birth of metaverse-related types of interactions. Game developers today are well along the pathway of understanding how to connect with, engage, and maintain engagement among users.

For instance, VRChat, a virtual reality platform, lets users create, customize, and voice-enable their own avatars. Roblox, another social gaming platform, has added voice chat options to its platform to allow users to communicate with each other via voice.

Not long after gaming’s first forays into virtual reality, other companies started capitalizing on the gaming world’s popularity and experience with virtual worlds and best practices to enhance their own user experiences. Home Depot, for instance, launched a metaverse on the Roblox platform, specifically appealing to younger users who represent a ready-made audience for metaverse applications given their experiences in the gaming world.

Among top tech players, a speech technology arms race is rapidly evolving. Microsoft has already introduced text-to-speech artificial intelligence, VALL-E, which will allow users to communicate in other languages, even if they don’t actually speak those languages. Not surprisingly, Microsoft has also been identified as the number one investor in the metaverse. Other players competing with Microsoft include Meta, Google, Decentraland, NVIDIA, Shopify, Unity Technologies, and Roblox.

NVIDIA, for instance, is partnering with Mozilla to develop automatic speech recognition that will work for every language speaker around the world, recognizing that standard voice assistants such as Alexa and Google Home support fewer than 1 percent of the world’s spoken languages.”

These are only a few examples of the rising interest in speech technology and its applications for virtual worlds like the metaverse. Virtual worlds will not be able to capture and retain user interest without speech capabilities, and developers, and the companies turning to these developers, know that.

Sound and Speech in the Metaverse

“In the metaverse, spatial audio can enhance the sense of presence and immersion by creating a realistic soundscape,” says Allison DeLeone, CEO of TPN+Evia, a creative and production agency based in Seattle. “It can help to localize and differentiate sounds from different sources, creating a more realistic, comfortable, and immersive experience.”

As an example of how this might work, DeLeone suggests imagining a walk through a forest. Spatial audio in the metaverse can help create the sensation of being surrounded by the sounds of leaves rustling, birds chirping, and water flowing, with each sound coming from a specific location in the virtual environment. “With spatial audio, users can pinpoint the location of sounds in virtual environments, making it easier to navigate and interact with digital objects,” she says.

Sound, of course, can also enhance social interactions in the metaverse. “Users can locate and interact based on their spatialized audio cues, helping to facilitate more natural and immersive conversations,” DeLeone says. This, she says, can be especially useful for people who might have difficulty with traditional text-based chat interfaces.

Sound and speech, while both auditory, are nevertheless two very distinct elements of interactive metaverse experiences. Grunwald maintains that one of the main challenges in implementing speech technology in the metaverse is the need for high-quality, natural-sounding speech. “This requires sophisticated speech synthesis and recognition technologies that can accurately interpret and respond to user input.”

In addition, he points out that companies also need to ensure that their speech technology will work seamlessly with other elements of the metaverse environment, like virtual objects, animations, and sound effects.

One of the biggest risks in the metaverse, DeLeone says, “is the potential for misuse or abuse, including hate speech, cyberbullying, and harassment.” Since interactions are often anonymized, she says, “it can be difficult to identify and hold individuals accountable.”

Nigel Cannings, cofounder and chief technology officer of Intelligent Voice, agrees: “As we have seen already with some of the nonsense and bile thrown out by ChatGPT, all is not rosy in the modern garden of large language models and natural language processing technology.”

It’s relatively easy, he says, “to subvert the technology or even to accidentally offend it.” That’s not only potentially annoying for adults but also “presents real challenges for dealing with interactions with children and teenagers, whose protection is not only required by law but is a moral imperative.”

In addition, Grunwald says, participants in virtual worlds “may share sensitive personal information, such as their voiceprints, which could be used to identify them in other contexts,” he says. “Companies need to ensure that their speech technology is secure and compliant with relevant data protection regulations, such as [the General Data Protection Regulation in Europe and the California Consumer Protection Act].”

The technology required to leverage and optimize speech in virtual environments can, itself, represent challenges as well.

Alberto Zamora, a metaverse computer-generated imagery artist and interactive developer at TPN+Evia, says that from a technical standpoint, “enabling avatars with speech capabilities can create unnecessary ambient noise that destroys the user experience.”

In addition, Cannings points out, such technology can’t just be used out of the box. There are a lot of words that need to be introduced that might be application-specific—names of characters or areas in the metaverse, for instance.

Another issue is cost, which can be one of the biggest barriers to properly voice-enabling applications. “We still see many companies relying on expensive cloud providers for their speech technology,” Cannings says.

Despite these potential drawbacks, the opportunities of speech technology in metaverse environments “is immensely exciting for anyone who has been working in the speech technology and NLP areas for any length of time,” he says. “This is really the realization of the dream that many of us have had.”

Speech technology, Cannings continues, “makes the metaverse significantly more immersive from an emotional standpoint, not just a visual standpoint, and so it is likely to engage more people for longer.”

Grunwald agrees. Ample opportunities exist for implementing speech technology in the metaverse, he says. “Speech-enabled avatars can provide a more immersive and naturalistic experience for users, allowing them to interact with virtual objects and other users in a more intuitive and engaging way.”

Speech technology can also be used to “gather valuable data about user preferences and behavior, which can be used to optimize products and services,” Grunwald says.

Don't Be Left Out

With so many companies already embarking on the creation and introduction of metaverse environments for their users, it’s not inconceivable that there will come a time when virtually every company needs to establish its own virtual world, powered with immersive speech technology to engage users in an immersive experience. That was the case, after all, with the introduction of the internet, when companies decided they needed their own individual web pages.

Making this happen requires considerations that are both technological and practical in nature.

“To voice-enable their avatars, companies need to consider several key factors, including the type of speech technology they want to use, the accuracy and reliability of that technology, and the potential impact on user privacy and security,” Grunwald suggests.

Tech blogger Michael Smith points to additional technology-related factors required to provide a positive user experience. These include the quality of the voice recognition software, the accuracy of speech-to-text conversion, and the integration of natural language processing capabilities.

“One of the biggest challenges for companies integrating speech technology in the metaverse is ensuring that the voice recognition software can accurately distinguish between different accents, languages, and speech patterns,” he says.

Considering the diversity of the metaverse user base, Smith adds, “means that voice recognition technology needs to be capable of understanding and processing a wide range of linguistic variations to ensure effective communication.”

To effectively bring speech capabilities into the metaverse environment, the following technical elements are required, Zamora says:

  • voice over Internet Protocol (VoIP) to transport the user’s voice over the internet;
  • spatial audio positioning; and
  • real-time lip-synchronization software, which analyzes the audio signals to determine which sounds are being produced and how they should be represented in the avatar’s lip movements through defined morphing targets.

In addition, “in the case that the avatar is controlled by the computer as a virtual host, other technologies must be able to make the 3-D character speak by using defined or AI-generated content,” Zamora says. These tools include text-to-speech and finite state machines to simulate a logical conversation using predefined texts.

In more advanced approaches, he says, developers might use an artificial intelligence language model such as ChatGPT.

Cannings points to NVIDIA as one example of a company that is “doing its best to help power speech in the metaverse.” NVIDIA’s Jarvis platform, he says, “enables conversational AI in a number of languages, and the new Omniverse Cloud platform looks to lead the way in easily enabling a multimodel metaverse.”

From a non-technological standpoint and to address potential security and privacy concerns, Grunwald recommends that companies have clear policies and guidelines in place to ensure that speech capabilities in metaverse environments are not being used in a way that might violate user rights or represent harm to the broader community.

Finally, from a user engagement perspective, if companies want to take their metaverse experiences to the masses, they need to think of ways to effectively engage and educate non-gamer audiences. Gamers likely have an edge over other potential users, like their parents and grandparents. But these latter audiences also represent potential for mainstream companies with large target audiences, like Home Depot, Coca-Cola, Wendy’s, and others, that are already establishing a presence and seeking ways to gain audience share.

Can you imagine interacting with others in a world without speech? Developers, companies, and their users can’t imagine interacting with others in the metaverse without it. The more realistic these interactions can be, and the more companies can ensure security and privacy, the more likely they will be able to gain awareness and build audience in their virtual worlds. 

Linda Pophal is a freelance business journalist and content marketer who writes for various business and trade publications. Pophal does content marketing for Fortune 500 companies, small businesses, and individuals on a wide range of subjects, from human resource management and employee relations to marketing, technology, healthcare industry trends, and more.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues