Using Lambda functions for Amazon Q Business document enrichment
You can use Lambda functions to prepare your document attributes for advanced data manipulation. For example, you could use Optical Character Recognition (OCR), which interprets text from images and treats each image as a textual document. Or, you could retrieve the current date-time in a specific time zone and then insert the date-time where there's an empty value for a date field.
You can choose to apply a basic operation first and then use a Lambda function to manipulate your data, and the reverse.
Note
Amazon Q Business can't create a target document attribute field if it isn't already created as an index field.
Topics
Lambda functions using the Amazon Q Business API
To apply a Lambda function, you specify your advanced data manipulation logic using the DocumentEnrichmentConfiguration object when you use either the BatchPutDocument API operation or the CreateDataSource operation.
Your Lambda functions must follow the mandatory request and response structures. For more information, see Data contracts for Lambda functions.
Use the following parameters to create your configuration:
-
InlineDocumentEnrichmentConfiguration
– Configuration information to alter document attributes during ingestion. -
PostExtractionHookConfiguration
– Configuration information to invoke a Lambda function on structured documents with their metadata and text already extracted. -
PreExtractionHookConfiguration
– Configuration information to invoke a Lambda function on raw documents before metadata and text has been extracted from them. -
PreExtractionHookConfiguration
RoleArn – The Amazon Resource Name (ARN) of a role underPreExtractionHookConfiguration
with permissions to runPreExtractionHookConfiguration
and to access the Amazon S3 bucket when you usePreExtractionHookConfiguration
. -
PostExtractionHookConfiguration
RoleArn – The Amazon Resource Name (ARN) of a role underPostExtractionHookConfiguration
with permissions to runPreExtractionHookConfiguration
and to access the Amazon S3 bucket when you usePostExtractionHookConfiguration
.
You can configure only one Lambda function for
PreExtractionHookConfiguration
and only one Lambda
function for PostExtractionHookConfiguration
. However, your Lambda function can invoke other functions that it requires.
You can configure both PreExtractionHookConfiguration
and
PostExtractionHookConfiguration
or either one. Your Lambda function for PreExtractionHookConfiguration
must not
exceed a run time of 5 minutes. Your Lambda function for
PostExtractionHookConfiguration
must not exceed a run time of 1
minute.
You can configure Amazon Q Business to invoke a Lambda function only if a condition is met. For example, you can specify a condition that, if there are empty date-time values, then Amazon Q Business invokes a function that inserts the current date-time.
For more information, see the following topics in the Amazon Q Business API Reference:
Lambda functions using the Amazon Q Business console
To configure a Lambda function using the console
-
Select your index, and then select Document enrichments from the navigation menu.
-
To configure Lambda functions, go to Configure Lambda functions.
IAM roles for Lambda functions
When you use the Lambda functions for CDE, you need an IAM role for the following:
-
A role for
PreExtractionHookConfiguration
with permissions to runPreExtractionHookConfiguration
and to access the Amazon S3 bucket when you usePreExtractionHookConfiguration
. -
A role for
PostExtractionHookConfiguration
with permissions to runPreExtractionHookConfiguration
and to access the Amazon S3 bucket when you usePostExtractionHookConfiguration
.
Important
IAM roles for Custom Document Enrichmmnt (CDE) Lambda functions should belong to the same account as the account using BatchPutDocument API operation or the CreateDataSource operation to configure CDE.
Both AWS Identity and Access Management (IAM) roles must have the permissions to:
-
Run
PreExtractionHookConfiguration
and/orPostExtractionHookConfiguration
. To apply advanced alterations of your document metadata and content during the ingestion process, configure a Lambda function forPreExtractionHookConfiguration
and/orPostExtractionHookConfiguration
. -
(Optional) If you choose to activate Server Side Encryption for your Amazon S3 bucket, you must provide permissions to use the AWS KMS key to encrypt and decrypt the objects stored in your Amazon S3 bucket.
A role policy to allow Amazon Q Business to run
PreExtractionHookConfiguration
with encryption for your
Amazon S3 bucket.
{ "Version": "2012-10-17", "Statement": [{ "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::bucket-name", "arn:aws:s3:::bucket-name/*" ], "Effect": "Allow" }, { "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::bucket-name" ], "Effect": "Allow" }, { "Effect": "Allow", "Action": [ "kms:Decrypt", "kms:GenerateDataKey" ], "Resource": [ "arn:aws:kms:your-region:your-account-id:key/key-id" ] }, { "Effect": "Allow", "Action": [ "lambda:InvokeFunction" ], "Resource": "arn:aws:lambda:your-region:your-account-id:function:pre-extraction-lambda-function" } ] }
An role policy to allow Amazon Q Business to run
PreExtractionHookConfiguration
without
encryption.
{ "Version": "2012-10-17", "Statement": [{ "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::bucket-name", "arn:aws:s3:::bucket-name/*" ], "Effect": "Allow" }, { "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::bucket-name" ], "Effect": "Allow" }, { "Effect": "Allow", "Action": [ "lambda:InvokeFunction" ], "Resource": "arn:aws:lambda:your-region:your-account-id:function:pre-extraction-lambda-function" } ] }
A role policy to allow Amazon Q Business to run
PostExtractionHookConfiguration
with encryption for your
Amazon S3 bucket.
{ "Version": "2012-10-17", "Statement": [{ "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::bucket-name", "arn:aws:s3:::bucket-name/*" ], "Effect": "Allow" }, { "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::bucket-name" ], "Effect": "Allow" }, { "Effect": "Allow", "Action": [ "kms:Decrypt", "kms:GenerateDataKey" ], "Resource": [ "arn:aws:kms:your-region:your-account-id:key/key-id" ] }, { "Effect": "Allow", "Action": [ "lambda:InvokeFunction" ], "Resource": "arn:aws:lambda:your-region:your-account-id:function:post-extraction-lambda-function" } ] }
An role policy to allow Amazon Q Business to run
PostExtractionHookConfiguration
without
encryption.
{ "Version": "2012-10-17", "Statement": [{ "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::bucket-name", "arn:aws:s3:::bucket-name/*" ], "Effect": "Allow" }, { "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::bucket-name" ], "Effect": "Allow" }, { "Effect": "Allow", "Action": [ "lambda:InvokeFunction" ], "Resource": "arn:aws:lambda:your-region:your-account-id:function:post-extraction-lambda-function" }] }
We recommend that you include aws:sourceAccount
and
aws:sourceArn
in the trust policy. Their inclusion limits
permissions and securely checks if aws:sourceAccount
and
aws:sourceArn
are the same values as provided in the IAM role policy for the sts:AssumeRole
action. This
approach prevents unauthorized entities from accessing your IAM
roles and their permissions. For more information, see confused deputy problem in the IAM User Guide.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "qbusiness.amazonaws.com" ] }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "aws:SourceAccount": "your-account-id" }, "StringLike": { "aws:SourceArn": "arn:aws:qbusiness:
your-region
:your-account-id
:application/<application-id
>/index/<index-id
>" } } } ] }
Use cases for Lambda functions
This section outlines two examples of using Lambda functions.
Example 1: Extracting text from images to create textual documents
The following is an example of using a Lambda function to run OCR to interpret
text from images and store this text in a field called
document_image_text
.
The following table shows data before advanced manipulation is applied.
_document_id | document_image |
---|---|
1 | image_1.png |
2 | image_2.png |
3 | image_3.png |
The following table shows data after advanced manipulation is applied.
_document_id | document_image | document_image_text |
---|---|---|
1 | image_1.png | Mailed survey response |
2 | image_2.png | Mailed survey response |
3 | image_3.png | Mailed survey response |
Example 2: Replacing empty values in the Last_Updated field with the current date-time
The following is an example of using a Lambda function to insert
the current date-time for empty date values. This example uses the condition
that, if a date field value is null
, then the value is replaced
with the current date-time.
The following table shows data before advanced manipulation is applied.
_document_id | _document_body | _last_updated_at |
---|---|---|
1 | Example text | January 1, 2020 |
2 | Example text | |
3 | Example text | July 1, 2020 |
The following table shows data after advanced manipulation is applied.
_document_id | _document_body | _last_updated_at |
---|---|---|
1 | Example text | January 1, 2020 |
2 | Example text | December 1, 2021 |
3 | Example text | July 1, 2020 |
Code examples of Lambda functions
The following code is an example of configuring a Lambda function for advanced data manipulation on the raw, original data.
Data contracts for Lambda functions
Lambda functions for advanced data manipulation interact with
Amazon Q Business data contracts. The contracts are the mandatory
request and response structures of your Lambda functions. If your
Lambda functions don't follow these structures, then Amazon Q Business produces an error. Your Lambda function for
PreExtractionHookConfiguration
should use the following request
structure:
{ "version": <str>, "dataBlobStringEncodedInBase64": <str>, //In the case of a data blob "s3Bucket": <str>, //In the case of an S3 bucket "s3ObjectKey": <str>, //In the case of an S3 bucket "metadata": <Metadata> }
The metadata
structure, which includes the
DocumentAttribute
structure, is as follows:
{ "attributes": [<DocumentAttribute<] } DocumentAttribute { "name": <str>, "value": <DocumentAttributeValue> } DocumentAttributeValue { "stringValue": <str>, "integerValue": <int>, "longValue": <long>, "stringListValue": list<str>, "dateValue": <str> }
Your Lambda function for PreExtractionHookConfiguration
must
adhere to the following response structure:
{ "version": <str>, "dataBlobStringEncodedInBase64": <str>, //In the case of a data blob "s3ObjectKey": <str>, //In the case of an S3 bucket "metadataUpdates": [<DocumentAttribute>] }
Your Lambda function for PostExtractionHookConfiguration
should
expect the following request structure:
{ "version": <str>, "s3Bucket": <str>, "s3ObjectKey": <str>, "metadata": <Metadata> }
Your Lambda function for PostExtractionHookConfiguration
must
adhere to the following response structure:
PostExtractionHookConfiguration Lambda Response { "version": <str>, "s3ObjectKey": <str>, "metadataUpdates": [<DocumentAttribute>] }
Amazon Q Business uploads your structured document to the specified Amazon S3 bucket. The structured document follows this format:
QBusiness document { "textContent": <TextContent> } TextContent { "documentBodyText": <str> }
Examples of Lambda functions that adhere to data contracts
This section provides examples of how to structure your Lambda functions that adhere to Amazon Q Business data contracts.
Example 1: A Lambda function that applies advanced manipulation to raw documents
The following Python code is an example of a Lambda function that applies advanced manipulation of the metadata
fields _authors
, _document_title
, and the body
content on the raw or original documents.
The following code example shows the case of the body content residing in an Amazon S3 bucket
import json import boto3 s3 = boto3.client("s3") # Lambda function for advanced data manipulation def lambda_handler(event, context): # Get the value of "S3Bucket" key name or item from the given event input s3_bucket = event.get("s3Bucket") # Get the value of "S3ObjectKey" key name or item from the given event input s3_object_key = event.get("s3ObjectKey") content_object_before_DE = s3.get_object(Bucket = s3_bucket, Key = s3_object_key) content_before_DE = content_object_before_DE["Body"].read().decode("utf-8"); content_after_DE = "DEInvolved " + content_before_DE # Get the value of "metadata" key name or item from the given event input metadata = event.get("metadata") # Get the document "attributes" from the metadata document_attributes = metadata.get("attributes") s3.put_object(Bucket = s3_bucket, Key = "dummy_updated_qbusiness_document", Body=json.dumps(content_after_DE)) return { "version": "v0", "s3ObjectKey": "dummy_updated_qbusiness_document", "metadataUpdates": [ {"name":"_document_title", "value":{"stringValue":"title_from_pre_extraction_lambda"}}, {"name":"_authors", "value":{"stringListValue":["author1", "author2"]}} ] }
Example 2: A Lambda function that applies advanced manipulation to structured or parsed documents
The following Python code is an example of a Lambda function that applies advanced manipulation of the metadata
fields _authors
, _document_title
, and the body
content on the structured or parsed documents.
import json import boto3 import time s3 = boto3.client("s3") # Lambda function for advanced data manipulation def lambda_handler(event, context): # Get the value of "S3Bucket" key name or item from the given event input s3_bucket = event.get("s3Bucket") # Get the value of "S3ObjectKey" key name or item from the given event input s3_key = event.get("s3ObjectKey") # Get the value of "metadata" key name or item from the given event input metadata = event.get("metadata") # Get the document "attributes" from the metadata document_attributes = metadata.get("attributes") qbusiness_document_object = s3.get_object(Bucket = s3_bucket, Key = s3_key) qbusiness_document_string = qbusiness_document_object['Body'].read().decode('utf-8') qbusiness_document = json.loads(qbusiness_document_string) qbusiness_document["textContent"]["documentBodyText"] = "Changing document body to a short sentence." s3.put_object(Bucket = s3_bucket, Key = "dummy_updated_qbusiness_document", Body=json.dumps(qbusiness_document)) return { "version" : "v0", "s3ObjectKey": "dummy_updated_qbusiness_document", "metadataUpdates": [ {"name": "_document_title", "value":{"stringValue": "title_from_post_extraction_lambda"}}, {"name": "_authors", "value":{"stringListValue":["author1", "author2"]}} ] }
Example 3: Body content residing in a data blob
import json import boto3 import base64 # Lambda function for advanced data manipulation def lambda_handler(event, context): # Get the value of "dataBlobStringEncodedInBase64" key name or item from the given event input data_blob_string_encoded_in_base64 = event.get("dataBlobStringEncodedInBase64") # Decode the data blob string in UTF-8 data_blob_string = base64.b64decode(data_blob_string_encoded_in_base64).decode("utf-8") # Get the value of "metadata" key name or item from the given event input metadata = event.get("metadata") # Get the document "attributes" from the metadata document_attributes = metadata.get("attributes") new_data_blob = "This should be the modified data in the document by pre processing lambda ".encode("utf-8") return { "version": "v0", "dataBlobStringEncodedInBase64": base64.b64encode(new_data_blob).decode("utf-8"), "metadataUpdates": [ {"name":"_document_title", "value":{"stringValue":"title_from_pre_extraction_lambda"}}, {"name":"_authors", "value":{"stringListValue":["author1", "author2"]}} ] }