Speech-to-Speech (Amazon Nova 2 Sonic)
Amazon Nova models provide powerful speech capabilities including speech understanding with real-time speech-to-speech conversations with Amazon Nova 2 Sonic.
Amazon Nova 2 Sonic enables real-time conversational AI with speech input and output. This section covers advanced capabilities for building interactive voice assistants, customer service automation and conversational applications.
Key features
Amazon Nova 2 Sonic provides the following capabilities:
-
State-of-the-art streaming speech understanding with bidirectional streaming API that enables real-time, low-latency multi-turn conversations.
-
Multilingual support with automatic language detection and switching. Expressive voices are offered, including both masculine-sounding and feminine-sounding voices, in the following languages:
-
English (US, UK, India, Australia)
-
French
-
Italian
-
German
-
Spanish
-
Portuguese
-
Hindi
-
-
Polyglot voices that can speak any of the supported languages to enable a consistent user experience even when the user switches languages within the same session.
-
Robustness to background noise for real world deployment scenarios.
-
Robustness to different accents for supported languages.
-
Natural, human-like conversational AI experiences with contextual richness across all supported languages.
-
Adaptive speech response that dynamically adjusts delivery based on the prosody of the input speech.
-
Intelligent turn-taking that detects when users finish speaking and when the assistant should respond, creating natural dialogue rhythm.
-
Graceful handling of user interruptions without dropping conversational context.
-
Knowledge grounding with enterprise data using Retrieval Augmented Generation (RAG).
-
Function calling and agentic workflow support for building complex AI applications.
-
Asynchronous tool handling that executes tool calls while maintaining conversation flow, allowing the assistant to continue speaking while tools process in the background.
-
Cross-modal input support for both audio and text inputs within the same conversation, enabling flexible interaction patterns.