Amazon Comprehend
Developer Guide

Batch APIs

Use Amazon Comprehend Medical to analyze medical text stored in an Amazon S3 bucket. Analyze up to 10 GB of documents in one batch. You use the console to create and manage batch analysis jobs, or use batch APIs for both medical entities and for protected health information (PHI). The APIs start, stop, list, and get information about analysis jobs.

To run a batch analysis job, you do the following:

  1. If you are using the Amazon Comprehend Medical API, create an IAM policy and attach it to a role. For more information, see IAM Policies for Batch Operations. If you are using the console, an IAM role is created for you.

  2. Upload your medical text to an Amazon S3 bucket.

  3. To start a new analysis job, use the console to start or use either the StartEntitiesDetectionV2Job operation or the StartPHIDetectionJob operation. When you start the job, you tell Amazon Comprehend Medical the S3 bucket that contains the input files and the S3 bucket to write the output files.

  4. Monitor the progress of the job by using the console or the DescribeEntitiesDetectionV2Job operation or the DescribePHIDetectionJob operation.

  5. Get the results of your analysis job from the output S3 bucket that you configured when you started the job.

Output Files

Amazon Comprehend Medical writes one output file for each input file in the batch. The file has the extension .out. Amazon Comprehend Medical first creates a directory in the output S3 bucket using the AwsAccountId-JobType-JobId as the name, and then writes all of the output files for the batch to this directory. Amazon Comprehend Medical creates this new directory so that output from one job does not overwrite the output of another.

The output from a batch operation produces the same output as a synchronous operation. For examples of the output generated by Amazon Comprehend Medical, see Detect Entities.

Each batch operation produces three manifest files that contain information about the job.

  • Manifest – Summarizes the job. Provides information about the parameters used for the job, the total size of the job, and the number of files processed.

  • success – Provides information about the files that were successfully processed. Includes the input and output file name and the size of the input file.

  • unprocessed – Lists files that the batch job did not process. Typically this is because the file was added to the input directory after the batch job was started.

Amazon Comprehend Medical writes the files to the output directory that you specified for the batch job. The following are examples of the manifest files.

Batch Manifest File

The following is an example of the batch manifest file

{ "Summary" : { "Status" : "COMPLETED | FAILED | PARTIAL_SUCCESS | STOPPED", "JobType" : "DetectEntitiesJob | PHIDetection", "InputDataConfiguration" : { "Bucket" : "input bucket", "Path" : "path to files/account ID-job type-job ID" }, "OutputDataConfiguration" : { "Bucket" : "output bucket", "Path" : "path to files" }, "InputFileCount" : number of files in input bucket, "TotalMeteredCharacters" : total characters processed from all files, "UnprocessedFilesCount" : number of files not processed, "SuccessFilesCount" : total number of files processed, "TotalDurationSeconds" : time required for processing, "SuccessfulFilesListLocation" : "path to file", "UnprocessedFilesListLocation" : "path to file" } }

Success Manifest File

The following is an example of the file that contains information about successfully processed files.

{ "Files": [{ "Input": "input path/input file name", "Output": "output path/output file name", "InputSize": size in bytes of input file }, { "Input": "input path/input file name", "Output": "output path/output file name", "InputSize": size in bytes of input file }] }

Unprocessed Manifest File

The following is an example of the manifest file that contains information about unprocessed files.

{ "Files": [ "input path/input file name", "input path/input file name" ] }

IAM Policies for Batch Operations

The IAM role that calls the Amazon Comprehend Medical batch APIs must have a policy that enables access to the Amazon S3 buckets that contain the input and output files. It must also be assigned a trust relationship that enables the Amazon Comprehend Medical service to assume the role.

The role must have the following policy:

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:GetObject" ], "Resource": [ "arn:aws:s3:::input-bucket/*" ], "Effect": "Allow" }, { "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::input-bucket", "arn:aws:s3:::output-bucket", ], "Effect": "Allow" }, { "Action": [ "s3:PutObject" ], "Resource": [ " arn:aws:s3:::output-bucket/*" ], "Effect": "Allow" } ] }

The role must have the following trust relationship:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "" ] }, "Action": "sts:AssumeRole" } ] }

On this page: