Creating a custom vocabulary using a table - Amazon Transcribe

Creating a custom vocabulary using a table

Using a table format is the most robust way to create your custom vocabulary. Vocabulary tables consist of four columns, and these columns can be in any order:

Phrase

SoundsLike

IPA

DisplayAs

Required. Every row in your table must contain an entry in this column.

If your entry contains multiple words, separate each word with a hyphen (-); do not use spaces. For example, Andorra-la-Vella or Los-Angeles.

For acronyms, any pronounced letters must be separated by a period. If your acronym is plural, you must use a hyphen between the acronym and the 's'. For example, 'CLI' is C.L.I. and 'ABCs' is A.B.C.-s.

If your phrase consists of both a word and an acronym, these two components must be separated by a hyphen. For example, 'DynamoDB' is Dynamo-D.B..

Do not use spaces in this column.

Optional. Rows in this column can be left empty.

Only add content to this column if your entry includes a non-standard word, such as a brand name, or to correct a word that is being incorrectly transcribed.

Break your entry down into hyphen-separated syllables that mimic how the word sounds. For example, Andorra-la-Vella sounds like ann-dor-ah-la-bay-ah and Los-Angeles sounds like loss-ann-gel-es. It's better to use small, common words over phonetic representations: for example, for 'Los Angeles', 'loss-ann-gel-es' is preferable to 'lahs-ahn-jul-ees'.

Do not use spaces in this column.

If you have an entry in this column, your IPA column must be empty.

Optional. Rows in this column can be left empty.

This column is intended for phonetic spellings using only characters in the International Phonetic Alphabet (IPA). For example, 'Los Angeles' is l ɔ s æ n ʤ ə l ə s and 'CLI' is s ɪ ɛ l aɪ.

You must add a single space between every IPA character (single-byte) or valid IPA character pair (double-byte).

If you have an entry in this column, your SoundsLike column must be empty.

Optional. Rows in this column can be left empty.

Defines the how you want your entry to look in your transcription output. For example, Andorra-la-Vella in the Phrase column is Andorra la Vella in the DisplayAs column.

If a row in this column is empty, Amazon Transcribe uses the contents of the Phrase column to determine output.

You can use spaces in this column.

Things to note when creating your table:

  • Your table must contain all four columns (Phrase, SoundsLike, IPA, and DisplayAs), but the Phrase column is the only one that must contain an entry on each row. All other columns can be left empty.

  • In a given row, you cannot have entries for both IPA and SoundsLike fields. You must choose one or the other, or leave both blank.

  • You can only use characters that are supported for your language. Refer to your language's character set for details.

  • Only use spaces within the IPA and DisplayAs columns. Separate columns with a TAB character. Do not use spaces to separate columns; doing so results in an error.

  • You must save your table as a text (*.txt) file and upload it into an Amazon S3 bucket before you can convert it into a useable vocabulary.

  • Your text file must be in LF format. If you use any other format, such as CRLF, your custom vocabulary is not accepted by Amazon Transcribe.

Note

Enter acronyms, or other words whose letters should be pronounced individually, as single letters separated by periods (A.B.C.). To enter the plural form of an acronym, such as 'ABCs', separate the 's' from the acronym with a hyphen (A.B.C.-s). You can use upper or lower case letters to define an acronym. Acronyms are not supported in all languages; refer to Supported languages and language-specific features.

Here is a sample custom vocabulary table (where [TAB] represents a tab character):

Phrase[TAB]SoundsLike[TAB]IPA[TAB]DisplayAs Los-Angeles[TAB][TAB]l ɔ s æ n ʤ ə l ə s[TAB]Los Angeles Eva-Maria [TAB]ay-va-ma-ree-ah[TAB][TAB] A.B.C.-s[TAB]ay-bee-sees[TAB]ABCs Amazon-dot-com[TAB][TAB][TAB]Amazon.com C.L.I.[TAB][TAB]s ɪ ɛ l aɪ[TAB]CLI Andorra-la-Vella[TAB]ann-do-rah-la-bay-ah[TAB][TAB]Andorra la Vella Dynamo-D.B.[TAB][TAB][TAB]DynamoDB

Here is the same table with aligned columns for visual clarity. Do not add spaces between columns in your vocabulary table; your table should look misaligned like the preceding example.

Phrase [TAB]SoundsLike [TAB]IPA [TAB]DisplayAs Los-Angeles [TAB] [TAB]l ɔ s æ n ʤ ə l ə s[TAB]Los Angeles Eva-Maria [TAB]ay-va-ma-ree-ah [TAB] [TAB] A.B.C.-s [TAB]ay-bee-sees [TAB] [TAB]ABCs amazon-dot-com [TAB] [TAB] [TAB]amazon.com C.L.I. [TAB] [TAB]s ɪ ɛ l aɪ [TAB]CLI Andorra-la-Vella[TAB]ann-do-rah-la-bay-ah[TAB] [TAB]Andorra la Vella Dynamo-D.B. [TAB] [TAB] [TAB]DynamoDB

Creating vocabulary tables

To process a custom vocabulary table for use with Amazon Transcribe, see the following examples:

Before continuing, save your vocabulary as a text (*.txt) file, then upload it into an Amazon S3 bucket.

  1. Sign in to the AWS Management Console.

  2. In the navigation pane, choose Custom vocabulary. This opens the Custom vocabulary page where you can view existing vocabularies or create a new one.

  3. Select the Create vocabulary button.

    This takes you to the Create vocabulary page. Enter a name for your new vocabulary.

    Select the S3 location option under Vocabulary input source. Then, either manually enter the Amazon S3 path or select Browse S3 to locate your vocabulary.

  4. Optionally, add tags to your vocabulary. Once you have all fields completed, click the Create vocabulary button at the bottom of the page. This takes you back to the Custom vocabulary page where you can view the status of your custom vocabulary. When the status changes from 'Pending' to 'Ready' your vocabulary can be used with a transcription.

  5. If the status changes to 'Failed', click on the name of your vocabulary to go to its information page.

    There is a Failure reason banner at the top of this page that provides information on why your vocabulary failed. Correct the error in your text file and try again.

This example uses the create-vocabulary command with a table-formatted vocabulary file. For more information, see CreateVocabulary.

To use an existing custom vocabulary in a transcription job, set the VocabularyName in the Settings field when you call the StartTranscriptionJob operation or, from the AWS Management Console, choose the vocabulary from the drop-down list.

aws transcribe create-vocabulary \ --vocabulary-name my-first-vocabulary \ --vocabulary-file-uri s3://DOC-EXAMPLE-BUCKET/my-vocabularies/my-vocabulary-file.txt \ --language-code en-US

Here's another example using the create-vocabulary command, and a request body that creates your vocabulary.

aws transcribe create-vocabulary \ --cli-input-json file://filepath/my-first-vocab-table.json

The file my-first-vocab-table.json contains the following request body.

{ "VocabularyName": "my-first-vocabulary", "VocabularyFileUri": "s3://DOC-EXAMPLE-BUCKET/my-vocabularies/my-vocabulary-table.txt", "LanguageCode": "en-US" }

Once VocabularyState changes from PENDING to READY, your vocabulary is ready to use with a transcription. To view the current status of your vocabulary, run:

aws transcribe get-vocabulary \ --vocabulary-name my-first-vocabulary

This example uses the AWS SDK for Python (Boto3) to create a custom vocabulary from a table using the create_vocabulary method. For more information, see CreateVocabulary.

To use an existing custom vocabulary in a transcription job, set the VocabularyName in the Settings field when you call the StartTranscriptionJob operation or, from the AWS Management Console, choose the vocabulary from the drop-down list.

For additional examples using the AWS SDKs, including feature-specific, scenario, and cross-service examples, refer to the Code examples for Amazon Transcribe using AWS SDKs chapter.

from __future__ import print_function import time import boto3 transcribe = boto3.client('transcribe', 'us-west-2') vocab_name = "my-first-vocabulary" response = transcribe.create_vocabulary( LanguageCode = 'en-US', VocabularyName = vocab_name, VocabularyFileUri = 's3://DOC-EXAMPLE-BUCKET/my-vocabularies/my-vocabulary-table.txt' ) while True: status = transcribe.get_vocabulary(VocabularyName = vocab_name) if status['VocabularyState'] in ['READY', 'FAILED']: break print("Not ready yet...") time.sleep(5) print(status)