Neural voices - Amazon Polly

Neural voices

Amazon Polly has a Neural text-to-speech (NTTS) engine that can produce even higher quality voices than its standard voices. Standard TTS voices use concatenative synthesis. The standard engine concatenates phonemes of recorded speech, producing very natural-sounding synthesized speech. However, the inevitable variations in speech and the techniques used to segment the waveforms limits the quality of speech. The Amazon Polly NTTS engine doesn't use standard concatenative synthesis to produce speech. It has two parts:

  • A neural network — that converts a sequence of phonemes (the most basic units of language) into a sequence of spectrograms. (Spectograms are snapshots of the energy levels in different frequency bands.)

  • A vocoder — that converts spectrograms into a nearly continuous audio signal.

The first component of the neural TTS system is a sequence-to-sequence model. This model doesn’t create its results solely from the corresponding input but also considers how the sequence of the elements of the input work together. The model chooses the spectrograms that it outputs so that their frequency bands emphasize acoustic features that the human brain uses when processing speech.

The output of this model then passes to a neural vocoder. This converts the spectrograms into speech waveforms. When trained on the large datasets used to build general-purpose concatenative-synthesis systems, this sequence-to-sequence approach will yield higher-quality, more natural-sounding voices.

Available neural voices

Neural voices are available in 33 languages and language variants. The following table lists the voices.

Language and language variants Language code Name/ID Gender

1

Arabic (Gulf)

ar-AE

Hala

Zayd

Female

Male

2

Belgian Dutch (Flemish)

nl-BE

Lisa

Female

3

Catalan

ca-ES

Arlet

Female

4

Chinese (Cantonese)

yue-CN

Hiujin

Female

5

Chinese (Mandarin)

cmn-CN

Zhiyu

Female

6

Danish

da-DK

Sofie

Female

7

Dutch

nl-NL

Laura

Female

8

English (Australian)

en-AU

Olivia

Female

9

English (British)

en-GB

Amy*

Emma

Brian

Arthur

Female

Female

Male

Male

10

English (Indian)

en-IN

Kajal

Female

11

English (Irish)

en-IE

Niamh

Female

12

English (New Zealand)

en-NZ

Aria

Female

13

English (South African)

en-ZA

Ayanda

Female

14

English (US)

en-US

Danielle

Gregory

Ivy

Joanna*

Kendra

Kimberly

Salli

Joey

Justin

Kevin

Matthew*

Ruth

Stephen

Female

Male

Female (child)

Female

Female

Female

Female

Male

Male (child)

Male (child)

Male

Female

Male

15

Finnish

fi-FI

Suvi

Female

16

French (Belgian)

fr-BE

Isabelle

Female

17

French (Canadian)

fr-CA

Gabrielle

Liam

Female

Male

18

French

fr-FR

Léa

Rémi

Female

Male

19

German

de-DE

Vicki

Daniel

Female

Male

20

German (Austrian)

de-AT

Hannah

Female

21

Hindi

hi-IN

Kajal

Female

22

Italian

it-IT

Bianca

Adriano

Female

Male

23

Japanese

ja-JP

Takumi

Kazuha

Tomoko

Male

Female

Female

24

Korean

ko-KR

Seoyeon

Female

25

Norwegian

nb-NO

Ida

Female

26

Polish

pl-PL

Ola

Female

27

Portuguese (Brazilian)

pt-BR

Camila

Vitória/Vitoria

Thiago

Female

Female

Male

28

Portuguese (European)

pt-PT

Inês/Ines

Female

29

Spanish (European)

es-ES

Lucia

Sergio

Female

Male

30

Spanish (Mexican)

es-MX

Mia

Andrés

Female

Male

31

Spanish (US)

es-US

Lupe*

Pedro

Female

Male

32

Swedish

sv-SE

Elin

Female

33

Turkish

tr-TR

Burcu

Female

*The Amy, Joanna, Lupe, and Matthew voices can be used with the Newscaster speaking style. For more information, see Newscaster voices.

Feature and region compatibility

Neural voices aren't available in all AWS Regions, nor do they support all Amazon Polly features.

Neural voices are supported in the following regions:

  • US East (N. Virginia): us-east-1

  • US West (Oregon): us-west-2

  • Africa (Cape Town): af-south-1

  • Asia Pacific (Tokyo): ap-northeast-1

  • Asia Pacific (Seoul): ap-northeast-2

  • Asia Pacific (Osaka): ap-northeast-3

  • Asia Pacific (Mumbai): ap-south-1

  • Asia Pacific (Singapore): ap-southeast-1

  • Asia Pacific (Sydney): ap-southeast-2

  • Canada (Central): ca-central-1

  • Europe (Frankfurt): eu-central-1

  • Europe (Ireland): eu-west-1

  • Europe (London): eu-west-2

  • Europe (Paris): eu-west-3

  • AWS GovCloud (US-West): us-gov-west-1

Endpoints and protocols for these Regions are identical to those used for standard voices. For more information, see Amazon Polly endpoints and quotas.

The following features are supported for neural voices:

  • Real-time and asynchronous speech synthesis operations.

  • Newscaster speaking style. For more information about the speaking styles, see Newscaster voices.

  • All speech marks.

  • Many (but not all) of the SSML tags that are supported by Amazon Polly. For more information about NTTS-supported SSML tags, see Supported Tags.

As with standard voices, you can choose from various sampling rates to optimize the bandwidth and audio quality for your application. Valid sampling rates for standard and neural voices are 8 kHz, 16 kHz, 22 kHz, or 24 kHz. The default for standard voices is 22 kHz. The default for neural voices is 24 kHz. Amazon Polly supports MP3, OGG (Vorbis), and raw PCM audio stream formats.

Using the Neural engine on the console

You can access Amazon Polly Neural voices through the Amazon Polly console or AWS CLI.

To use the neural engine on the console
  1. Open the Amazon Polly console at https://console.aws.amazon.com/polly/.

  2. From the console, choose the Neural engine.

  3. Choose the desired voice from the voice dropdown menu.

  4. Generate TTS audio with text of your choice.