Starting a custom entity detection job (API)
You can use the API to start and monitor an async analysis job for custom entity recognition.
To start a custom entity detection job with the StartEntitiesDetectionJob operation, you provide the EntityRecognizerArn, which is the Amazon Resource Name (ARN) of the trained model. You can find this ARN in the response to the CreateEntityRecognizer operation.
Topics
Detecting custom entities using the AWS Command Line Interface
Use the following example for Unix, Linux, and macOS environments. For Windows, replace the backslash (\) Unix continuation character at the end of each line with a caret (^). To detect custom entities in a document set, use the following request syntax:
aws comprehend start-entities-detection-job \ --entity-recognizer-arn "arn:aws:comprehend:
region
:account number
:entity-recognizer/test-6" \ --job-name infer-1 \ --data-access-role-arn "arn:aws:iam::account number
:role/service-role/AmazonComprehendServiceRole-role" \ --language-code en \ --input-data-config "S3Uri=s3://Bucket Name
/Bucket Path
" \ --output-data-config "S3Uri=s3://Bucket Name
/Bucket Path
/" \ --regionregion
Amazon Comprehend responds with the JobID
and JobStatus
and will return the output from the job in
the S3 bucket that you specified in the request.
Detecting custom entities using the AWS SDK for Java
For Amazon Comprehend examples that use Java, see Amazon Comprehend Java examples
Detecting custom entities using the AWS SDK for Python (Boto3)
This example creates a custom entity recognizer, trains the model, and then runs it in an entity recognizer job using the AWS SDK for Python (Boto3).
Instantiate the SDK for Python.
import boto3 import uuid comprehend = boto3.client("comprehend", region_name="
region
")
Create an entity recognizer:
response = comprehend.create_entity_recognizer( RecognizerName="Recognizer-Name-Goes-Here-{}".format(str(uuid.uuid4())), LanguageCode="en", DataAccessRoleArn="
Role ARN
", InputDataConfig={ "EntityTypes": [ { "Type": "ENTITY_TYPE
" } ], "Documents": { "S3Uri": "s3://Bucket Name
/Bucket Path
/documents" }, "Annotations": { "S3Uri": "s3://Bucket Name
/Bucket Path
/annotations" } } ) recognizer_arn = response["EntityRecognizerArn"]
List all recognizers:
response = comprehend.list_entity_recognizers()
Wait for the entity recognizer to reach TRAINED status:
while True: response = comprehend.describe_entity_recognizer( EntityRecognizerArn=recognizer_arn ) status = response["EntityRecognizerProperties"]["Status"] if "IN_ERROR" == status: sys.exit(1) if "TRAINED" == status: break time.sleep(10)
Start a custom entities detection job:
response = comprehend.start_entities_detection_job( EntityRecognizerArn=recognizer_arn, JobName="Detection-Job-Name-{}".format(str(uuid.uuid4())), LanguageCode="en", DataAccessRoleArn="
Role ARN
", InputDataConfig={ "InputFormat": "ONE_DOC_PER_LINE", "S3Uri": "s3://Bucket Name
/Bucket Path
/documents" }, OutputDataConfig={ "S3Uri": "s3://Bucket Name
/Bucket Path
/output" } )
Overriding API actions for PDF files
For image files and PDF files, you can override the default extraction actions using the
DocumentReaderConfig
parameter in InputDataConfig
.
The following example defines a JSON file named myInputDataConfig.json to set the InputDataConfig
values.
It sets DocumentReadConfig
to use the Amazon Textract DetectDocumentText
API for all PDF files.
"InputDataConfig": { "S3Uri": s3://
Bucket Name
/Bucket Path
", "InputFormat": "ONE_DOC_PER_FILE", "DocumentReaderConfig": { "DocumentReadAction": "TEXTRACT_DETECT_DOCUMENT_TEXT", "DocumentReadMode": "FORCE_DOCUMENT_READ_ACTION" } }
In the StartEntitiesDetectionJob
operation, specify the myInputDataConfig.json file as
the InputDataConfig
parameter:
--input-data-config file://myInputDataConfig.json
For more information about the DocumentReaderConfig
parameters, see
Setting text extraction options.