Data format - Text Analysis with Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) and Amazon Comprehend

Data format

This solution implements built-in preprocessing logic to invoke Amazon Comprehend to help customers extract insight from indexed documents. Before the document is indexed in Amazon OpenSearch Service, the original document is converted to another format with the Amazon Comprehend result. This section provides detailed information about the extended document format.

For example, an original uploaded document may have the following JSON structure:

{ "timestamp": string, "transcript": string }

By default, the following Amazon Comprehend operations values are used to preprocess the original document. You can modify these operations based on your specific business needs.

  1. DetectDominantLanguage

  2. DetectSentiment

  3. DetectEntities

  4. DetectSyntax

  5. DetectKeyPhrases

When the original document is preprocessed with the Amazon Comprehend operations, the document will be extended into the following structure:

{ "timestamp": string, "transcript": string, "transcript_DetectDominantLanguage": nested, "transcript_DetectSentiment": object, "transcript_DetectEntities": nested, "transcript_DetectKeyPhrases": nested, "transcript_DetectSyntax": nested }

The value of the extended field retains the original Amazon Comprehend API response. For example, DetectEntities will look as follows:

"transcript_DetectEntities": { "Entities": [ { "BeginOffset": number, "EndOffset": number, "Score": number, "Text": "string", "Type": "string" } ] }

For more detailed information about the Amazon Comprehend API response, refer to API Reference in the Amazon Comprehend Developer Guide.

Note that for Amazon Comprehend operations (with the exception of DetectSentiment), the original response returns an array of objects. The solution also creates a mapping as a nested datatype to maintain each array object independently. For more details, refer to OpenSearch nest datatype.

If Amazon Comprehend operations fail, the original document will be extended to show the error information.

{ "timestamp": string, "transcript": string, “transcript_DetectDominantLanguage_Error”: object “transcript_DetectSentiment_Error”: object, “transcript_DetectEntities_Error”: object, “transcript_DetectKeyPhrases_Error”: object, “transcript_DetectSyntax_Error”: object }

The following error shows the syntax.

“transcript_DetectDominantLanguage_Error”: { “statusCode”: String, “errorCode”: String, “errorMessage”: String, “requestId”: String }