AWS Data Pipeline
Developer Guide (API Version 2012-10-29)

Before You Begin

Be sure to complete the following steps:

Be aware of the following:

  • Imports may overwrite data in your DynamoDB table. When you import data from Amazon S3, the import may overwrite items in your DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times.

  • Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export DynamoDB to S3 template will append the job's scheduled time to the Amazon S3 bucket path, which will help you avoid this problem.

  • Import and Export jobs will consume some of your DynamoDB table's provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The Amazon EMR cluster will consume some read capacity during exports or write capacity during imports. You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings MyImportJob.myDynamoDBWriteThroughputRatio and MyExportJob.myDynamoDBReadThroughputRatio. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table's provisioned capacity in the middle of the process.

  • Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services that are being used. The import and export pipelines will create Amazon EMR clusters to read and write data and there are per-instance charges for each node in the cluster. You can read more about the details of Amazon EMR Pricing. The default cluster configuration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this configuration in the pipeline definition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing.

  • If a table is configured for On-Demand Capacity, change the table back to provisioned capacity before running the export or import operations. On-Demand Capacity removes provisioned throughput and the pipeline will need a throughput ratio to calculate resources to use from the DynamoDB table. You may use CloudWatch metrics to evaluate the aggregate of throughput the table has used and provision the throughput capacity accordingly.

Create a DynamoDB Table

You can create the DynamoDB table that is required for this tutorial. If you already have a DynamoDB table, you can skip this procedure to create one.

For more information, see Working with Tables in DynamoDB in the Amazon DynamoDB Developer Guide.

To create a DynamoDB table

  1. Open the DynamoDB console at

  2. Click Create Table.

  3. Enter a unique name for your table in Table Name.

  4. In the Primary Key: Partition Key field, enter the number Id.

  5. Click Continue to skip the optional Add Indexes page.

  6. On the Provisioned Throughput Capacity page, do the following. Note that these values are small because the sample data is small. For information about calculating the required size for your own data, see Provisioned Throughput in Amazon DynamoDB in the Amazon DynamoDB Developer Guide.

    1. In Read Capacity Units, enter 5.

    2. In Write Capacity Units, enter 5.

    3. Click Continue.

  7. On the Throughput Alarms page, in Send notification to, enter your email address, and then click Continue.

  8. On the Review page, click Create.

On this page: