Amazon Transcribe
Developer Guide

Amazon Transcribe Streaming Format

Amazon Transcribe uses a format called event stream encoding for streaming transcription. This format encoded binary data with header information that describes the contents of each event. You can use this information for applications that call the Amazon Transcribe endpoint without using the Amazon Transcribe SDK.

Amazon Transcribe uses the HTTP/2 protocol for streaming transcriptions. The key components for a streaming request are:

  • A header frame. This contains the HTTP headers for the request, and a signature in the authorization header that Amazon Transcribe uses as a seed signature to sign the following data frames.

  • One or more message frames in event stream encoding. The frame contains metadata and the raw audio bytes.

  • An end frame. This is a signed message in event stream encoding with an empty body.

Event Stream Encoding

Event stream encoding provides bidirectional communication using messages between a client and a server. Data frames sent to the Amazon Transcribe streaming service are encoded in this format., The response from Amazon Transcribe also uses this encoding.

Each message consists of two sections: the prelude and the data. The prelude consists of:

  1. The total byte length of the message

  2. The combined byte length of all of the headers

The data section consists of:

  1. The headers

  2. A payload

Each section ends with a 4-byte big-endian integer CRC checksum. Amazon Transcribe uses CRC32 (often referred to as GZIP CRC32) to calculate both CRCs. For more information about CRC32, see GZIP file format specification version 4.3.

Total message overhead, including the prelude and both checksums, is 16 bytes.

The following diagram shows the components that make up a message and a header. There are multiple headers per message.

Each message contains the following components:

  • Prelude: Always a fixed size of 8 bytes, two fields of 4 bytes each.

    • First 4 bytes: The total byte-length. This is the big-endian integer byte-length of the entire message, including the 4-byte length field itself.

    • Second 4 bytes: The headers byte-length. This is the big-endian integer byte-length of the headers portion of the message, excluding the headers length field itself.

  • Prelude CRC: The 4-byte CRC checksum for the prelude portion of the message, excluding the CRC itself. The prelude has a separate CRC from the message CRC to ensure that Amazon Transcribe can detect corrupted byte-length information immediately without causing errors such as buffer overruns.

  • Headers: Metadata annotating the message, such as the message type, content type, and so on. Messages have multiple headers. Headers are key-value pairs where the key is a UTF-8 string. Headers can appear in any order in the headers portion of the message and any given header can appear only once. For the required header types, see the following sections.

  • Payload:The audio content to be transcribed.

  • Message CRC: The 4-byte CRC checksum from the start of the message to the start of the checksum. That is, everything in the message except the CRC itself.

Each header contains the following components. There are multiple headers per frame.

  • Header name byte-length: The byte-length of the header name.

  • Header name: The name of the header indicating the header type. For valid values, see the following frame descriptions.

  • Header value type: An enumeration indicating the header value type.

  • Value string byte length: The byte-length of the header value string.

  • Header value: The value of the header string. Valid values for this field depend on the type of header. For valid values, see the following frame descriptions.

Streaming Request

To make a streaming request, you use the StartStreamTranscription operation.

Header Frame

The header frame is the authorization frame for the streaming transcription. Amazon Transcribe uses the value of the authorization header as the seed for generating a chain of authorization headers for the data frames in the request.

Required Headers

The header frame of a request to Amazon Transcribe requires the following HTTP/2 headers:

POST /stream-transcription HTTP/2.0 host: transcribestreaming.region.amazonaws.com authorization: Generated value content-type: application/vnd.amazon.eventstream x-amz-target: com.amazonaws.transcribe.Transcribe.StartStreamTranscription x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-EVENTS x-amz-date: Date x-amz-transcribe-language-code: en-US x-amz-transcribe-media-encoding: pcm x-amz-transcribe-sample-rate: Sample rate transfer-encoding: chunked

In the request, use the following values for the host, authorization, and x-amz-date headers:

For more information about the headers specific to Amazon Transcribe, see the StartStreamTranscription operation.

Data Frames

Each request contains one or more data frames. The data frames use event stream encoding. The encoding supports bidirectional data transmission between a client and a server.

There are two steps to creating a data frame:

  1. Combine the raw audio data with metadata to create the payload of the request.

  2. Combine the payload with a signature to form the event message that is sent to Amazon Transcribe.

The following diagram shows how this works.

Create the Audio Event

To create the message to send to Amazon Transcribe, create the audio event. Combine the headers described in the following table with a chunk of audio bytes into an event-encoded message.

To create the payload for the event message, use a buffer in raw-byte format.

Create the Message

Create a data frame using the audio event payload to send to Amazon Transcribe. The data frame contains event-encoding headers that include the current date and a signature for the audio chunk and the audio event. To indicate to Amazon Transcribe that the audio stream is complete, send an empty data frame that contains only the date and signature.

To create the signature for the data frame, first create a string to sign, and then calculate the signature for the event. Construct the string to sign as follows:

String stringToSign = "AWS4-HMAC-SHA256-PAYLOAD" + "\n" + DATE + "\n" + KEYPATH + "\n" + Hex(priorSignature) + "\n" + HexHash(nonSignatureHeaders) + "\n" + HexHash(payload);
  • DATE: The current date and time in Universal Time Coordinated (UTC) and using the ISO 8601 format. Don't include milliseconds in the date. For example, 20190127T223754Z is 22:37:54 on 1/27/2019.

  • KEYPATH: The signature scope in the format date/region/service/aws4_request. For example, 20190127/us-east-1/transcribe/aws4_request.

  • priorSignature: The signature for the previous frame. For the first data frame, use the signature of the header frame.

  • nonSignatureHeaders: The DATE header encoded as a string.

  • payload: The byte buffer containing the audio event data.

  • Hex: A function that encodes its input into a hexadecimal representation.

  • HexHash: A function that first creates a SHA-256 hash of its input and then uses the Hex function to encode the hash.

After you have constructed the string to sign, sign it using the key that you derived for Signature Version 4, as follows. For details, see Examples of How to Derive a Signing Key for Signature Version 4 in the Amazon Web Services General Reference.

String signature = HMACSHA256(derivedSigningKey, stringToSign);
  • HMACSHA256: A function that creates a signature using the SHA-256 hash function.

  • derivedSigningKey: The Signature Version 4 signing key.

  • stringToSign: The string that you calculated for the data frame.

After you have calculated the signature for the data frame, construct a byte buffer containing the date, the signature, and the audio event payload. Send the byte array to Amazon Transcribe for transcription.

End Frame

To indicate that the audio stream is complete, send an end frame to Amazon Transcribe. The end frame is a data frame with an empty payload. You construct the end frame the same way that you construct a data frame.

Streaming Response

Responses from Amazon Transcribe are also sent using event stream encoding. Use this information to decode a response from the StartStreamTranscription operation.

Transcription Response

A transcription response is event encoded. It contains the standard prelude and the following headers:

For details, see Event Stream Encoding.

When the response is decoded, it contains the following information:

:content-type: "application/json" :event-type: "TranscriptEvent" :message-type: "event" JSON transcription information

For an example of the JSON structure returned by Amazon Transcribe, see Using Amazon Transcribe Streaming.

Exception Response

If there is an error in processing your transcription stream, Amazon Transcribe sends an exception response. The response is event encoded. For details, see Event Stream Encoding.

The response contains the standard prelude and the following headers:

When the exception response is decoded, it contains the following information:

:content-type: "application/json" :event-type: "BadRequestException" :message-type: "exception" Exception message

Example Request and Response

The following is an end-to-end example of a streaming transcription request. In this example, binary data is represented as base64-encoded strings. In an actual response, the data are raw bytes.

Step 1 - Start the Session With Amazon Transcribe

To start the session, send an HTTP/2 request to Amazon Transcribe:

POST /stream-transcription HTTP/2.0 host: transcribestreaming.region.amazonaws.com authorization: Generated value content-type: application/vnd.amazon.eventstream x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-EVENTS x-amz-date: Date x-amzn-transcribe-language-code: en-US x-amzn-transcribe-media-encoding: pcm x-amzn-transcribe-sample-rate: Sample rate transfer-encoding: chunked

Step 2 - Send Authentication Information to Amazon Transcribe

Amazon Transcribe sends the following response:

HTTP/2.0 200 x-amzn-transcribe-language-code: en-US x-amzn-transcribe-sample-rate: Sample rate x-amzn-request-id: 8a08df7d-5998-48bf-a303-484355b4ab4e x-amzn-transcribe-session-id: b4526fcf-5eee-4361-8192-d1cb9e9d6887 x-amzn-transcribe-media-encoding: pcm x-amzn-RequestId: 8a08df7d-5998-48bf-a303-484355b4ab4e content-type: application/vnd.amazon.eventstream

Step 3 - Create an Audio Event

Create an audio event containing the audio data to send. For details, see Event Stream Encoding. The binary data in this request is base64-encoded. In an actual request, the data is raw bytes.

:content-type: "application/octet-stream" :event-type: "AudioEvent" :message-type: "event" UklGRjzxPQBXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YVTwPQAAAAAAAAAAAAAAAAD//wIA/f8EAA==

Step 4 - Create an Audio Event Message

Create an audio message that contains the audio data to send to Amazon Transcribe. For details, see Event Stream Encoding. The audio event data in this example is base64-encoded. In an actual request, the data is raw bytes.

:date: 2019-01-29T01:56:17.291Z :chunk-signature: signature AAAA0gAAAIKVoRFcTTcjb250ZW50LXR5cGUHABhhcHBsaWNhdGlvbi9vY3RldC1zdHJlYW0LOmV2ZW50LXR5 cGUHAApBdWRpb0V2ZW50DTptZXNzYWdlLXR5cGUHAAVldmVudAxDb256ZW50LVR5cGUHABphcHBsaWNhdGlv bi94LWFtei1qc29uLTEuMVJJRkY88T0AV0FWRWZtdCAQAAAAAQABAIA+AAAAfQAAAgAQAGRhdGFU8D0AAAAA AAAAAAAAAAAA//8CAP3/BAC7QLFf

Step 5 - Use the Response from Amazon Transcribe

Amazon Transcribe creates a stream of transcription events that it sends to your application. The events are sent in raw-byte format. In this example, the bytes are base64-encoded.

The response from Amazon Transcribe is:

AAAAUwAAAEP1RHpYBTpkYXRlCAAAAWiXUkMLEDpjaHVuay1zaWduYXR1cmUGACCt6Zy+uymwEK2SrLp/zVBI 5eGn83jdBwCaRUBJA+eaDafqjqI=

To see the transcription results, decode the raw bytes using event-stream encoding:

:event-type: "TranscriptEvent" :content-type: "application/json" :message-type: "event" {"Transcript":{"Results":[results]}}

For an example of the JSON structure returned by Amazon Transcribe, see Using Amazon Transcribe Streaming.

Step 6 - End the Transcription Stream

Finally, send an empty audio event to Amazon Transcribe to end the transcription stream. Create the audio event exactly like any other, except with an empty payload. Sign the event and include the signature in the :chunk-signature header, as follows:

:date: 2019-01-29T01:56:17.291Z :chunk-signature: signature