-->

Soul App Launches Full-Duplex Voice Model

Article Featured Image

Soul App has upgraded its end-to-end full-duplex voice call model, abandoning traditional concepts like voice activity detection, commonly used to detect speech start/end, and latency, breaking away from the industry-standard turn-by-turn interaction pattern.

Instead, it empowers AI to autonomously decide speaking timing, such as proactively breaking silence, appropriately interrupting users, listening while speaking, perceiving time semantics, enabling parallel discussions, and more. The model also supports multi-dimensional perception (including time, environment, and event awareness) and natural speech features (e.g., fillers, stammering, noticeable emotional fluctuations).

"Social interaction is an exchange of emotional and informational value. Soul remains committed to leveraging innovative technology and product solutions to deliver smarter, more immersive, and higher-quality interactive experiences, making loneliness go away for all," said Tao Ming, chief technology officer of Soul App, in a statement.

Soul has now upgraded the model with the following capabilities:

  • Full-Duplex Interaction -- The new model enables stream prediction for responses, listening, and interruptions. AI autonomously decides when to speak, achieving true end-to-end full-duplex interaction, where AI and users can talk simultaneously (e.g., debating, arguing, singing), appropriately interrupt each other, or proactively break silence to initiate topics.
  • Colloquial & Emotional Expression -- The model achieves comprehensive enhancements across multiple dimensions, including emotional expression, vocal characteristics, and conversational content. In emotional expression, beyond foundational capabilities like laughter, crying, or anger, the upgraded model delivers more pronounced vocal fluctuations that evolve naturally with the conversation. Its pronunciation now incorporates organic speech elements such as filler words, occasional stammering, common catchphrases, coughs, and other everyday vocal nuances. Furthermore, AI-generated dialogue leans into colloquial and socially fluid language rather than rigid, written-language patterns.
  • Contextual Awareness -- Built on a pure autoregressive architecture with unified text/audio generation (Unified Model), the model leverages strong large language model capabilities to integrate persona, time, environment, and contextual dialogue into AI responses. This allows perceptive, understanding AI to better shape digital personalities, create rich storylines, and transform interactions into genuine exchanges of emotion and information.

Soul's AI team is exploring how to extend its full-duplex voice call model to multi-person scenarios. For example, in group voice conversations, the AI leverages its autonomous decision-making capability to identify optimal speaking moments, facilitate topic discussions and extensions, and integrate into authentic social dynamics as an active participant.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues