Asynchronous Batch Processing - Amazon Translate

Asynchronous Batch Processing

To translate large collections of documents (up to 5 GB in size), use the Amazon Translate asynchronous batch processing operation, StartTextTranslationJob. This is best for collections of short documents, such as social media postings or user reviews, or any situation in which instantaneous translation is not required.

To perform an asynchronous batch translation, you typically perform the following steps:

  1. Store a set of documents in an input folder inside of an Amazon S3 bucket.

  2. Start a batch translation job.

  3. As part of your request, provide Amazon Translate with an IAM role that has read access to the input Amazon S3 folder. The role must also have read and write access to an output Amazon S3 bucket.

  4. Monitor the progress of the batch translation job.

  5. Retrieve the results of the batch translation job from the specified output bucket.

Region Availability

Batch translation is supported in the following AWS Regions:

  • US East (N. Virginia)

  • US East (Ohio)

  • US West (Oregon)

  • Asia Pacific (Seoul)

  • Europe (Frankfurt)

  • Europe (Ireland)

  • Europe (London)

Prerequisites

The following prerequisites must be met in order for Amazon Translate to perform a successful batch translation job:

  • The Amazon S3 buckets that contain your input and output documents must be in the same AWS Region as the API endpoint you are calling.

  • Documents must be UTF-8 formatted .txt, .html, .docx, .pptx or .xlsx files.

  • The collection of batch input documents must be 5 GB or less in size.

  • There can be a maximum of one million documents submitted in a batch translation job.

  • Each input document must be 20 MB or less and must contain fewer than 1 million characters.

  • Your input files must be in a folder in an Amazon S3 bucket. If your files are not in a folder, and they reside at the top level of a bucket, Amazon Translate throws an error when you attempt to run a batch translation job. This requirement applies only to input files. No folder is necessary for the output files, and Amazon Translate can place them at the top level of an Amazon S3 bucket.

Prerequisite Permissions

Before you can run a batch translation job, your AWS account must have a service role in IAM. This role must have a permissions policy that grants Amazon Translate:

  • Read access to your input folder in Amazon S3.

  • Read and write access to your output bucket.

It must also include a trust policy that allows Amazon Translate to assume the role and gain its permissions. This trust policy must allow the translate.amazonaws.com service principal to perform the sts:AssumeRole action.

You provide the Amazon Resource Name (ARN) of the role when you submit a batch translation job to Amazon Translate.

For more information, see Creating a Role to Delegate Permissions to an AWS Service in the IAM User Guide.

Example Permissions Policy

The following example permissions policy grants read access to an input folder in an Amazon S3 bucket. It grants read and write access to an output bucket.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:GetObject", "Resource": [ "arn:aws:s3:::input-bucket-name/*", "arn:aws:s3:::output-bucket-name/*" ] }, { "Effect": "Allow", "Action": "s3:ListBucket", "Resource": [ "arn:aws:s3:::input-bucket-name", "arn:aws:s3:::output-bucket-name" ] }, { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": "arn:aws:s3:::output-bucket-name/*" } ] }

Example Trust Policy

The following trust policy allows Amazon Translate to assume the IAM role that the policy belongs to.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "translate.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }

Starting a Batch Translation Job

To submit a batch translation job, use either the Amazon Translate console or the StartTextTranslationJob operation.

Your request must include an InputDataConfig object. This object includes the ContentType parameter, where you specify the format of your input documents with one of the following values:

If your input documents are Use this value for the ContentType parameter
Plain text (.txt) text/plain
HTML (.html or .xml) text/html
Word document (.docx) application/vnd.openxmlformats-officedocument.wordprocessingml.document
PowerPoint presentation (.pptx) application/vnd.openxmlformats-officedocument.presentationml.presentation
Excel workbook (.xlsx) application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Important

Amazon Translate does not automatically detect a source language during batch translation jobs.

Example start-text-translation-job command

The following example starts a translation job by using the AWS CLI to run the start-text-translation-job command:

$ aws translate start-text-translation-job --job-name batch-test \ --source-language-code en \ --target-language-codes fr \ --input-data-config S3Uri=s3://input-bucket-name/folder,ContentType=text/plain \ --output-data-config S3Uri=s3://output-bucket-name/ \ --data-access-role-arn arn:aws:iam::012345678901:role/service-role/AmazonTranslateInputOutputAccess

This command returns the following output:

{ "JobStatus": "SUBMITTED", "JobId": "1c1838f470806ab9c3e0057f14717bed" }
Note

Batch translation jobs are long-running operations and can take significant time to complete. For example, batch translation on a small dataset might take a few minutes, while very large datasets may take up to 2 days. Completion time is also dependant on the availability of resources.

Monitoring and Analyzing Batch Translation Jobs

You can use a job's ID to monitor its progress and get the Amazon S3 location of its output documents. To monitor a specific job, use the DescribeTextTranslationJob operation. You can also use the ListTextTranslationJobs operation to retrieve information on all of the translation jobs in your account. To restrict results to jobs that match a certain criteria, use the ListTextTranslationJobs operation's filter parameter. You can filter results by job name, job status, or the date and time that the job was submitted.

Example describe-text-translation-job command

The following example check's a job's status by using the AWS CLI to run the describe-text-translation-job command:

$ aws translate describe-text-translation-job --job-id 1c1838f470806ab9c3e0057f14717bed

This command returns the following output:

{ "TextTranslationJobProperties": { "InputDataConfig": { "ContentType": "text/plain", "S3Uri": "s3://input-bucket-name/folder" }, "EndTime": 1576551359.483, "SourceLanguageCode": "en", "DataAccessRoleArn": "arn:aws:iam::012345678901:role/service-role/AmazonTranslateInputOutputAccess", "JobId": "1c1838f470806ab9c3e0057f14717bed", "TargetLanguageCodes": [ "fr" ], "JobName": "batch-test", "SubmittedTime": 1576544017.357, "JobStatus": "COMPLETED", "Message": "Your job has completed successfully.", "JobDetails": { "InputDocumentsCount": 77, "DocumentsWithErrorsCount": 0, "TranslatedDocumentsCount": 77 }, "OutputDataConfig": { "S3Uri": "s3://bucket-name/output/012345678901-TranslateText-1c1838f470806ab9c3e0057f14717bed/" } } }

You can stop a batch translation job while its status is IN_PROGRESS by using the StopTextTranslationJob operation.

Example stop-text-translation-job command

The following example stops a batch translation with by using the AWS CLI to run the stop-text-translation-job command:

$ aws translate stop-text-translation-job --job-id 5236d36ce5192abdb3e2519f3ab8b065

This command returns the following output:

{ "TextTranslationJobProperties": { "InputDataConfig": { "ContentType": "text/plain", "S3Uri": "s3://input-bucket-name/folder" }, "SourceLanguageCode": "en", "DataAccessRoleArn": "arn:aws:iam::012345678901:role/service-role/AmazonTranslateInputOutputAccess", "TargetLanguageCodes": [ "fr" ], "JobName": "canceled-test", "SubmittedTime": 1576558958.167, "JobStatus": "STOP_REQUESTED", "JobId": "5236d36ce5192abdb3e2519f3ab8b065", "OutputDataConfig": { "S3Uri": "s3://output-bucket-name/012345678901-TranslateText-5236d36ce5192abdb3e2519f3ab8b065/" } } }

Getting Batch Translation Results

Once the job's status is COMPLETED or COMPLETED_WITH_ERROR, your output documents are available in the Amazon S3 folder you specified. The output document names match the input document names, with the addition of the target language code as a prefix. For instance, if you translated a document called mySourceText.txt into French, the output document will be called fr.mySourceText.txt.

If the status of a batch translation job is FAILED, the DescribeTextTranslationJob operation response includes a Message field that describes the reason why the job didn't complete successfully.

Each batch translation job also generates an auxiliary file that contains information on the translations performed, such as the total number of characters translated and the number of errors encountered. This file, called target-language-code.auxiliary-translation-details.json, is generated in the details subfolder of your output folder.

The following is an example of a batch translation auxiliary file.

{ "sourceLanguageCode": "en", "targetLanguageCode": "fr", "charactersTranslated": "105", "documentCountWithCustomerError": "0", "documentCountWithServerError": "0", "inputDataPrefix": "s3://input-bucket-name/folder", "outputDataPrefix": "s3://output-bucket-name/012345678901-TranslateText-1c1838f470806ab9c3e0057f14717bed/", "details": [ { "sourceFile": "mySourceText.txt", "targetFile": "fr.mySourceText.txt", "auxiliaryData": { "appliedTerminologies": [ { "name": "TestTerminology", "terms": [ { "sourceText": "Amazon", "targetText": "Amazon" } ] } ] } }, { "sourceFile": "batchText.txt", "targetFile": "fr.batchText.txt", "auxiliaryData": { "appliedTerminologies": [ { "name": "TestTerminology", "terms": [ { "sourceText": "Amazon", "targetText": "Amazon" } ] } ] } } ] }