To use Amazon EMR (Amazon EMR) and Hive to manipulate data in DynamoDB, you need the following:
An Amazon Web Services account. If you do not have one, you can get an account by going to http://aws.amazon.com, and clicking Create an AWS Account.
An DynamoDB table that contains data on the same account used with Amazon EMR.
A customized version of Hive that includes connectivity to DynamoDB (Hive 0.7.1.3 or later or—if you are using the binary data type—Hive 0.8.1.5 or later).
These versions of Hive require the Amazon EMR AMI version 2.0 or later
and Hadoop 0.20.205. The latest version of Hive provided by Amazon EMR is available by default when you launch an Amazon EMR cluster from the AWS Management Console or from a version
of the Command Line Interface for Amazon EMR
downloaded after 11 December 2011. If you launch a cluster
using the AWS SDK or the API, you must explicitly set the AMI version to
latest and the Hive version to
0.7.1.3 or later. For more information about
Amazon EMR AMIs and Hive versioning, go to
Specifying the Amazon EMR AMI Version and to
Configuring Hive in the Amazon EMR Developer Guide.
Support for DynamoDB connectivity. This is loaded on the Amazon EMR AMI version 2.0.2 or later.
(Optional) An Amazon S3 bucket. For instructions about how to create a bucket, see Get Started With Amazon Simple Storage Service. This bucket is used as a destination when exporting DynamoDB data to Amazon S3 or as a location to store a Hive script.
(Optional) A Secure Shell (SSH) client application to connect to the master node of the Amazon EMR cluster and run HiveQL queries against the DynamoDB data. SSH is used to run Hive interactively. You can also save Hive commands in a text file and have Amazon EMR run the Hive commands from the script. In this case an SSH client is not necessary, though the ability to SSH into the master node is useful even in non-interactive clusters, for debugging purposes.
An SSH client is available by default on most Linux, Unix, and Mac OS X installations. Windows users can install and use an SSH client called PuTTY.
(Optional) An Amazon EC2 key pair. This is only required for interactive clusters. The key pair provides the credentials the SSH client uses to connect to the master node. If you are running the Hive commands from a script in an Amazon S3 bucket, an EC2 key pair is optional.