Amazon EMR
Developer Guide

Prerequisites for Integrating Amazon EMR with DynamoDB

This documentation is for AMI versions 2.x and 3.x of Amazon EMR. For information about Amazon EMR releases 4.0.0 and above, see the Amazon EMR Release Guide. For information about managing the Amazon EMR service in 4.x releases, see the Amazon EMR Management Guide.

To use Amazon EMR (Amazon EMR) and Hive to manipulate data in DynamoDB, you need the following:

  • An AWS account. If you do not have one, you can get an account by going to, and clicking Create an AWS Account.

  • A DynamoDB table that contains data on the same account used with Amazon EMR.

  • A customized version of Hive that includes connectivity to DynamoDB.The latest version of Hive provided by Amazon EMR is available by default when you launch an Amazon EMR cluster from the AWS Management Console . For more information about Amazon EMR AMIs and Hive versioning, go to Specifying the Amazon EMR AMI Version and to Configuring Hive in the Amazon EMR Developer Guide.

  • Support for DynamoDB connectivity. This is included in the Amazon EMR AMI version 2.0.2 or later.

  • (Optional) An Amazon S3 bucket. For instructions about how to create a bucket, see the Amazon Simple Storage Service Getting Started Guide. This bucket is used as a destination when exporting DynamoDB data to Amazon S3 or as a location to store a Hive script.

  • (Optional) A Secure Shell (SSH) client application to connect to the master node of the Amazon EMR cluster and run HiveQL queries against the DynamoDB data. SSH is used to run Hive interactively. You can also save Hive commands in a text file and have Amazon EMR run the Hive commands from the script. In this case an SSH client is not necessary, though the ability to SSH into the master node is useful even in non-interactive clusters, for debugging purposes.

    An SSH client is available by default on most Linux, Unix, and Mac OS X installations. Windows users can install and use the PuTTY client, which has SSH support.

  • (Optional) An Amazon EC2 key pair. This is only required for interactive clusters. The key pair provides the credentials the SSH client uses to connect to the master node. If you are running the Hive commands from a script in an Amazon S3 bucket, an EC2 key pair is optional.