Step 1: Adding documents to Amazon S3
Before starting the Amazon Comprehend analysis jobs, you need to store a sample dataset of customer reviews in Amazon Simple Storage Service (Amazon S3). Amazon S3 hosts your data in containers called buckets. Amazon Comprehend can analyze documents stored in a bucket and it sends results of the analysis to a bucket. In this step, you create an S3 bucket, create input and output folders in the bucket, and upload a sample dataset to the bucket.
Topics
Prerequisites
Before you begin, review Tutorial: Analyzing insights from customer reviews with Amazon Comprehend and complete the prerequisites.
Download sample data
The following sample dataset contains Amazon reviews taken from the larger dataset "Amazon reviews - Full", which was published with the article "Character-level Convolutional Networks for Text Classification" (Xiang Zhang et al., 2015). Download the dataset to your computer.
To get the sample data
-
Download the zip file tutorial-reviews-data.zip to your computer.
-
Extract the zip file on your computer. There are two files. The file
THIRD_PARTY_LICENSES.txt
is the open source license for the dataset published by Xiang Zhang et al. The fileamazon-reviews.csv
is the dataset you analyze in the tutorial.
Create an Amazon S3 bucket
After downloading the sample dataset, create an Amazon S3 bucket to store your input and output data. You can create an S3 bucket using the Amazon S3 console or the AWS Command Line Interface (AWS CLI).
In the Amazon S3 console, you create a bucket with a name that is unique in all of AWS.
To create an S3 bucket (console)
Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/
. -
In Buckets, choose Create bucket.
-
For Bucket name, enter a globally unique name that describes the bucket's purpose.
-
For Region, choose the AWS Region where you want to create the bucket. The Region you choose must support Amazon Comprehend. To reduce latency, choose the AWS Region closest to your geographic location that is supported by Amazon Comprehend. For a list of Regions that support Amazon Comprehend, see the Region table
in the Global Infrastructure Guide. -
Leave the default settings for Object Ownership, Bucket settings for Block Public Access, Bucket Versioning, and Tags.
-
For Default encryption, choose Disable.
Tip
While this tutorial does not use encryption, you might want to use encryption when analyzing important data. For end-to-end encryption, you can encrypt your data at rest in the bucket and also when you run analysis jobs. For more information about encryption with AWS, see What is AWS Key Management Service? in the AWS Key Management Service Developer Guide.
-
Review your bucket configurations and then choose Create bucket.
After opening the AWS CLI, you run the create-bucket
command to
create a bucket that will store the input and output data.
To create an Amazon S3 bucket (AWS CLI)
-
To create your bucket, run the following command in the AWS CLI. Replace amzn-s3-demo-bucket with a name for the bucket that is unique in all of AWS.
aws s3api create-bucket --bucket amzn-s3-demo-bucket
By default, the
create-bucket
command creates a bucket in theus-east-1
AWS Region. To create a bucket in an AWS Region other thanus-east-1
, add theLocationConstraint
parameter to specify your Region. For example, the following command creates a bucket in theus-west-2
Region.aws s3api create-bucket --bucket amzn-s3-demo-bucket --region us-west-2 --create-bucket-configuration LocationConstraint=us-west-2
Note that only certain Regions support Amazon Comprehend. For a list of Regions that support Amazon Comprehend, see the Region table
in the Global Infrastructure Guide. -
To ensure that your bucket was created successfully, run the following command. The command lists all of the S3 buckets associated with your account.
aws s3 ls
(Console only) create folders
Next, create two folders in your S3 bucket. The first folder is for your input data. The second folder is where Amazon Comprehend sends the analysis results. If you use the Amazon S3 console, you have to manually create the folders. If you use the AWS CLI, you can create folders when you upload the sample dataset or run an analysis job. For that reason, we provide a procedure for creating folders only for console users. If you are using the AWS CLI, you will create folders in Upload the input data and in Step 3: Running analysis jobs on documents in Amazon S3.
To create folders in your S3 bucket (console)
Open the Amazon S3 console at https://console.aws.amazon.com/s3/
. -
In Buckets, choose your bucket from the list of buckets.
-
In the Overview tab, choose Create folder.
-
For the new folder name, enter
input
. -
For the encryption settings, choose None (Use bucket settings).
-
Choose Save.
-
Repeat steps 3 through 6 to create another folder for the output of the analysis jobs, but in step 4, enter the new folder name
output
.
Upload the input data
Now that you have a bucket, upload the sample dataset amazon-reviews.csv
.
You can upload data to S3 buckets with the Amazon S3 console or the AWS CLI.
In the Amazon S3 console, upload the sample dataset file to the input folder.
To upload the sample documents (console)
Open the Amazon S3 console at https://console.aws.amazon.com/s3/
. -
In Buckets, choose your bucket from the list of buckets.
-
Choose the
input
folder and then choose Upload. -
Choose Add files and then choose the
amazon-reviews.csv
file on your computer. -
Leave the other settings at their default values.
-
Choose Upload.
Create an input folder in your S3 bucket and upload the dataset file to the
new folder with the cp
command.
To upload the sample documents (AWS CLI)
-
To upload the
amazon-reviews.csv
file to a new folder in your bucket, run the following AWS CLI command. Replace amzn-s3-demo-bucket with the name of your bucket. By adding the path/input/
at the end, Amazon S3 automatically creates a new folder calledinput
in your bucket and uploads the dataset file to that folder.aws s3 cp amazon-reviews.csv s3://amzn-s3-demo-bucket/input/
-
To ensure that your file was uploaded successfully, run the following command. The command lists the contents of your bucket's
input
folder.aws s3 ls s3://amzn-s3-demo-bucket/input/
Now, you have an S3 bucket with the amazon-reviews.csv
file in a folder
called input
. If you used the console, you also have an output
folder in the bucket. If you used the AWS CLI, you will create the output folder when running
the Amazon Comprehend analysis jobs.