Speech-to-Speech (Amazon Nova 2 Sonic) - Amazon Nova

Speech-to-Speech (Amazon Nova 2 Sonic)

Amazon Nova models provide powerful speech capabilities including speech understanding with real-time speech-to-speech conversations with Amazon Nova 2 Sonic.

Amazon Nova 2 Sonic enables real-time conversational AI with speech input and output. This section covers advanced capabilities for building interactive voice assistants, customer service automation and conversational applications.

Key features

Amazon Nova 2 Sonic provides the following capabilities:

  • State-of-the-art streaming speech understanding with bidirectional streaming API that enables real-time, low-latency multi-turn conversations.

  • Multilingual support with automatic language detection and switching. Expressive voices are offered, including both masculine-sounding and feminine-sounding voices, in the following languages:

    • English (US, UK, India, Australia)

    • French

    • Italian

    • German

    • Spanish

    • Portuguese

    • Hindi

  • Polyglot voices that can speak any of the supported languages to enable a consistent user experience even when the user switches languages within the same session.

  • Robustness to background noise for real world deployment scenarios.

  • Robustness to different accents for supported languages.

  • Natural, human-like conversational AI experiences with contextual richness across all supported languages.

  • Adaptive speech response that dynamically adjusts delivery based on the prosody of the input speech.

  • Intelligent turn-taking that detects when users finish speaking and when the assistant should respond, creating natural dialogue rhythm.

  • Graceful handling of user interruptions without dropping conversational context.

  • Knowledge grounding with enterprise data using Retrieval Augmented Generation (RAG).

  • Function calling and agentic workflow support for building complex AI applications.

  • Asynchronous tool handling that executes tool calls while maintaining conversation flow, allowing the assistant to continue speaking while tools process in the background.

  • Cross-modal input support for both audio and text inputs within the same conversation, enabling flexible interaction patterns.