Amazon SageMaker
Developer Guide

Sequence2Sequence

Amazon SageMaker seq2seq is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker seq2seq is based on the Sockeye package, which uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures.

Note

Although Amazon SageMaker seq2seq is based on the Sockeye package, it uses a different input data format and renames some hyperparameters to work more effectively in Amazon SageMaker.

Input/Output Interface

Training

Although the Amazon SageMaker seq2seq algorithm relies on the Sockeye package, there are certain notable differences.

  • It expects data in recordio-protobuf format similar to other Amazon SageMaker algorithms, whereas Sockeye expects it in a tokenized text format.

  • It renames certain hyperparameters to work more effectively in Amazon SageMaker.

  • It supports a subset of training and inference options that Sockeye currently offers.

A script to convert data from tokenized text files to the protobuf format is included in the seq2seq example notebook. In general, it packs the data into 32-bit integer tensors and generates the necessary vocabulary files, which are needed for metric calculation and inference.

After preprocessing is done, the algorithm can be invoked for training. The algoritm expects three channels:

  • train: It should contain the training data (for example, the train.rec file generated by the preprocessing script).

  • validation: It should contain the validation data (for example, the val.rec file generated by the preprocessing script).

  • vocab: It should contain two vocabulary files (vocab.src.json and vocab.trg.json)

If the algorithm doesn't find data in any of these three channels, training results in an error.

Inference

For hosted endpoints, inference supports two data formats. To perform inference using space separated text tokens, use the application/json format. Otherwise, use the recordio-protobuf format to work with the integer encoded data. Both mode supports batching of input data. application/json format also allows you to visualize the attention matrix.

  • application/json: Expects the input in JSON format and returns the output in JSON format. Both content and accept types should be application/json. Each sequence is expected to be a string with whitespace separated tokens. This format is recommended when the number of source sequences in the batch is small. It also supports the following additional configuration options:

    configuration: {attention_matrix: true}: Returns the attention matrix for the particular input sequence.

  • application/x-recordio-protobuf: Expects the input in recordio-protobuf format and returns the output in recordio-protobuf format. Both content and accept types should be applications/x-recordio-protobuf. For this format, the source sequences must be converted into a list of integers for subsequent protobuf encoding. This format is recommended for bulk inference.

For batch transform, inference supports JSON Lines format. Batch transform expects the input in JSON Lines format and returns the output in JSON Lines format. Both content and accept types should be application/jsonlines. The format for input is as follows:

content-type: application/jsonlines {"source": "source_sequence_0"} {"source": "source_sequence_1"}

The format for response is as follows:

accept: application/jsonlines {"target": "predicted_sequence_0"} {"target": "predicted_sequence_1"}

Please refer to the notebook for additional details on how to serialize and deserialize the inputs and outputs to specific formats for inference.

EC2 Instance Recommendation

Currently Amazon SageMaker seq2seq is only set up to train on a single machine, but it does offer support for multiple GPUs.