Step 3: Formatting the entities analysis output as Amazon Kendra metadata
To convert the entities extracted by Amazon Comprehend to the metadata format required by an Amazon Kendra
index, you run a Python 3 script. The results of the conversion are stored in the
metadata
folder in your Amazon S3 bucket.
For more information on Amazon Kendra metadata format and structure, see S3 document metadata.
Topics
Downloading and extracting the Amazon Comprehend output
To format the Amazon Comprehend entities analysis output, you must first download the Amazon Comprehend entities
analysis output.tar.gz
archive and extract the entities analysis
file.
-
In the Amazon Comprehend console navigation pane, navigate to Analysis jobs.
-
Choose your entities analysis job
data-entities-analysis
. -
Under Output, choose the link displayed next to Output data location. This redirects you to the
output.tar.gz
archive in your S3 bucket. -
In the Overview tab, choose Download.
Tip
The output of all Amazon Comprehend analysis jobs have the same name. Renaming your archive will help you track it more easily.
-
Decompress and extract the downloaded Amazon Comprehend file to your device.
-
To access the name of the Amazon Comprehend auto-generated folder in your S3 bucket which contains the results of the entities analysis job, use the describe-entities-detection-job
command: -
From the
OutputDataConfig
object in your entities job description, copy and save theS3Uri
value ascomprehend-S3uri
on a text editor.Note
The
S3Uri
value has a format similar tos3://amzn-s3-demo-bucket/.../output/output.tar.gz
. -
To download the entities output archive, use the copy
command: -
To extract the entities output, run the following command on a terminal window:
At the end of this step, you should have a file on your device called
output
with a list of Amazon Comprehend identified entities.
Uploading the output into the S3 bucket
After downloading and extracting the Amazon Comprehend entities analysis file, you upload the
extracted output
file to your Amazon S3 bucket.
Open the Amazon S3 console at https://console.aws.amazon.com/s3/
. -
In Buckets, click on the name of your bucket and then choose Upload.
-
In Files and folders, choose Add files.
-
In the dialog box, navigate to your extracted
output
file in your device, select it, and choose Open. -
Keep the default settings for Destination, Permissions, and Properties.
-
Choose Upload.
Converting the output to Amazon Kendra metadata format
To convert the Amazon Comprehend output to Amazon Kendra metadata, you run a Python 3 script. If you are using the Console, you use AWS CloudShell for this step.
-
Download the converter.py.zip zipped file on your device.
-
Extract the Python 3 file
converter.py
. -
Sign into the AWS Management Console
and make sure your AWS region is set to the same region as your S3 bucket and your Amazon Comprehend analysis job. -
Choose the AWS CloudShell icon or type AWS CloudShell in the Search box on the top navigation bar to launch an environment.
Note
When AWS CloudShell launches in a new browser window for the first time, a welcome panel displays and lists key features. The shell is ready for interaction after you close this panel and the command prompt displays.
-
After the terminal is prepared, choose Actions from the navigation pane and then choose Upload file from the menu.
-
In the dialog box that opens, choose Select file and then choose the downloaded Python 3 file
converter.py
from your device. Choose Upload. -
In the AWS CloudShell environment, enter the following command:
python3 converter.py
-
When the shell interface prompts you to Enter the name of your S3 bucket, enter the name of your S3 bucket and press enter.
-
When the shell interface prompts you to Enter the full filepath to your Comprehend output file, enter
output
and press enter. -
When the shell interface prompts you to Enter the full filepath to your metadata folder, enter
metadata/
and press enter.
Important
For the metadata to be formatted correctly, the input values in steps 8-10 must be exact.
-
To download the Python 3 file
converter.py
, run the following command on a terminal window: -
To extract the Python 3 file, run the following command on the terminal window:
-
Make sure that Boto3 is installed on your device by running the following command.
Note
If you do not have Boto3 installed, run
pip3 install boto3
to install it. -
To run the Python 3 script to convert the
output
file, run the following command. -
When the AWS CLI prompts you to
Enter the name of your S3 bucket
, enter the name of your S3 bucket and press enter. -
When the AWS CLI prompts you to
Enter the full filepath to your Comprehend output file
, enteroutput
and press enter. -
When the AWS CLI prompts you to
Enter the full filepath to your metadata folder
, entermetadata/
and press enter.
Important
For the metadata to be formatted correctly, the input values in steps 5-7 must be exact.
At the end of this step, the formatted metadata is deposited inside the
metadata
folder in your S3 bucket.
Cleaning up your Amazon S3 bucket
Since the Amazon Kendra index syncs all files stored in a bucket, we recommend you clean up your Amazon S3 bucket to prevent redundant search results.
Open the Amazon S3 console at https://console.aws.amazon.com/s3/
. -
In Buckets, choose your bucket and then select the Amazon Comprehend entity analysis output folder, the Amazon Comprehend entity analysis
.temp
file, and the extracted Amazon Comprehendoutput
file. -
From the Overview tab choose Delete.
-
In Delete objects, choose Permanently delete objects? and enter
permanently delete
in the text input field. -
Choose Delete objects.
At the end of this step, you have converted the Amazon Comprehend entities analysis output to Amazon Kendra metadata. You are now ready to create an Amazon Kendra index.