ClassifyDocument
Creates a classification request to analyze a single document in real-time. ClassifyDocument
supports the following model types:
-
Custom classifier - a custom model that you have created and trained. For input, you can provide plain text, a single-page document (PDF, Word, or image), or Amazon Textract API output. For more information, see Custom classification in the Amazon Comprehend Developer Guide.
-
Prompt safety classifier - Amazon Comprehend provides a pre-trained model for classifying input prompts for generative AI applications. For input, you provide English plain text input. For prompt safety classification, the response includes only the
Classes
field. For more information about prompt safety classifiers, see Prompt safety classification in the Amazon Comprehend Developer Guide.
If the system detects errors while processing a page in the input document,
the API response includes an Errors
field that describes the errors.
If the system detects a document-level error in your input document, the API returns an
InvalidRequestException
error response.
For details about this exception, see
Errors in semi-structured documents in the Comprehend Developer Guide.
Request Syntax
{
"Bytes": blob
,
"DocumentReaderConfig": {
"DocumentReadAction": "string
",
"DocumentReadMode": "string
",
"FeatureTypes": [ "string
" ]
},
"EndpointArn": "string
",
"Text": "string
"
}
Request Parameters
For information about the parameters that are common to all actions, see Common Parameters.
The request accepts the following data in JSON format.
- Bytes
-
Use the
Bytes
parameter to input a text, PDF, Word or image file.When you classify a document using a custom model, you can also use the
Bytes
parameter to input an Amazon TextractDetectDocumentText
orAnalyzeDocument
output file.To classify a document using the prompt safety classifier, use the
Text
parameter for input.Provide the input document as a sequence of base64-encoded bytes. If your code uses an AWS SDK to classify documents, the SDK may encode the document file bytes for you.
The maximum length of this field depends on the input document type. For details, see Inputs for real-time custom analysis in the Comprehend Developer Guide.
If you use the
Bytes
parameter, do not use theText
parameter.Type: Base64-encoded binary data object
Length Constraints: Minimum length of 1.
Required: No
- DocumentReaderConfig
-
Provides configuration parameters to override the default actions for extracting text from PDF documents and image files.
Type: DocumentReaderConfig object
Required: No
- EndpointArn
-
The Amazon Resource Number (ARN) of the endpoint.
For prompt safety classification, Amazon Comprehend provides the endpoint ARN. For more information about prompt safety classifiers, see Prompt safety classification in the Amazon Comprehend Developer Guide
For custom classification, you create an endpoint for your custom model. For more information, see Using Amazon Comprehend endpoints.
Type: String
Length Constraints: Maximum length of 256.
Pattern:
arn:aws(-[^:]+)?:comprehend:[a-zA-Z0-9-]*:([0-9]{12}|aws):document-classifier-endpoint/[a-zA-Z0-9](-*[a-zA-Z0-9])*
Required: Yes
- Text
-
The document text to be analyzed. If you enter text using this parameter, do not use the
Bytes
parameter.Type: String
Length Constraints: Minimum length of 1.
Required: No
Response Syntax
{
"Classes": [
{
"Name": "string",
"Page": number,
"Score": number
}
],
"DocumentMetadata": {
"ExtractedCharacters": [
{
"Count": number,
"Page": number
}
],
"Pages": number
},
"DocumentType": [
{
"Page": number,
"Type": "string"
}
],
"Errors": [
{
"ErrorCode": "string",
"ErrorMessage": "string",
"Page": number
}
],
"Labels": [
{
"Name": "string",
"Page": number,
"Score": number
}
],
"Warnings": [
{
"Page": number,
"WarnCode": "string",
"WarnMessage": "string"
}
]
}
Response Elements
If the action is successful, the service sends back an HTTP 200 response.
The following data is returned in JSON format by the service.
- Classes
-
The classes used by the document being analyzed. These are used for models trained in multi-class mode. Individual classes are mutually exclusive and each document is expected to have only a single class assigned to it. For example, an animal can be a dog or a cat, but not both at the same time.
For prompt safety classification, the response includes only two classes (SAFE_PROMPT and UNSAFE_PROMPT), along with a confidence score for each class. The value range of the score is zero to one, where one is the highest confidence.
Type: Array of DocumentClass objects
- DocumentMetadata
-
Extraction information about the document. This field is present in the response only if your request includes the
Byte
parameter.Type: DocumentMetadata object
- DocumentType
-
The document type for each page in the input document. This field is present in the response only if your request includes the
Byte
parameter.Type: Array of DocumentTypeListItem objects
- Errors
-
Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors.
Type: Array of ErrorsListItem objects
- Labels
-
The labels used in the document being analyzed. These are used for multi-label trained models. Individual labels represent different categories that are related in some manner and are not mutually exclusive. For example, a movie can be just an action movie, or it can be an action movie, a science fiction movie, and a comedy, all at the same time.
Type: Array of DocumentLabel objects
- Warnings
-
Warnings detected while processing the input document. The response includes a warning if there is a mismatch between the input document type and the model type associated with the endpoint that you specified. The response can also include warnings for individual pages that have a mismatch.
The field is empty if the system generated no warnings.
Type: Array of WarningsListItem objects
Errors
For information about the errors that are common to all actions, see Common Errors.
- InternalServerException
-
An internal server error occurred. Retry your request.
HTTP Status Code: 500
- InvalidRequestException
-
The request is invalid.
HTTP Status Code: 400
- ResourceUnavailableException
-
The specified resource is not available. Check the resource and try your request again.
HTTP Status Code: 400
- TextSizeLimitExceededException
-
The size of the input text exceeds the limit. Use a smaller document.
HTTP Status Code: 400
See Also
For more information about using this API in one of the language-specific AWS SDKs, see the following: