Export, Import, Query, and Join Tables in DynamoDB Using Amazon EMR
- Prerequisites for Integrating Amazon EMR with DynamoDB
- Step 1: Create a Key Pair
- Create a Cluster
- Step 3: SSH into the Master Node
- Set Up a Hive Table to Run Hive Commands
- Hive Command Examples for Exporting, Importing, and Querying Data in DynamoDB
- Optimizing Performance for Amazon EMR Operations in DynamoDB
In the following sections, you will learn how to use Amazon Elastic MapReduce (Amazon EMR) with a customized version of Hive that includes connectivity to DynamoDB to perform operations on data stored in DynamoDB, such as:
Loading DynamoDB data into the Hadoop Distributed File System (HDFS) and using it as input into an Amazon EMR cluster.
Querying live DynamoDB data using SQL-like statements (HiveQL).
Joining data stored in DynamoDB and exporting it or querying against the joined data.
Exporting data stored in DynamoDB to Amazon S3.
Importing data stored in Amazon S3 to DynamoDB.
To perform each of the tasks above, you'll launch an Amazon EMR cluster, specify the location of the data in DynamoDB, and issue Hive commands to manipulate the data in DynamoDB.
DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. Developers can create a database table and grow its request traffic or storage without limit. DynamoDB automatically spreads the data and traffic for the table over a sufficient number of servers to handle the request capacity specified by the customer and the amount of data stored, while maintaining consistent, fast performance. Using Amazon EMR and Hive you can quickly and efficiently process large amounts of data, such as data stored in DynamoDB. For more information about DynamoDB go to the DynamoDB Developer Guide.
Apache Hive is a software layer that you can use to query map reduce clusters using a simplified, SQL-like query language called HiveQL. It runs on top of the Hadoop architecture. For more information about Hive and HiveQL, go to the HiveQL Language Manual.
There are several ways to launch an Amazon EMR cluster: you can use the Amazon EMR console, the command line interface (CLI), or you can program your cluster using the AWS SDK or the API. You can also choose whether to run a Hive cluster interactively or from a script. In this section, we will show you how to launch an interactive Hive cluster from the Amazon EMR console and the CLI.
Using Hive interactively is a great way to test query performance and tune your application. Once you have established a set of Hive commands that will run on a regular basis, consider creating a Hive script that Amazon EMR can run for you. For more information about how to run Hive from a script, go to Submit Hive Work.
Amazon EMR read or write operations on an DynamoDB table count against your established provisioned throughput, potentially increasing the frequency of provisioned throughput exceptions. For large requests, Amazon EMR implements retries with exponential backoff to manage the request load on the DynamoDB table. Running Amazon EMR jobs concurrently with other traffic may cause you to exceed the allocated provisioned throughput level. You can monitor this by checking the ThrottleRequests metric in Amazon CloudWatch. If the request load is too high, you can relaunch the cluster and set the Read Percent Setting or Write Percent Setting to a lower value to throttle the Amazon EMR operations. For information about DynamoDB throughput settings, see Provisioned Throughput.