Querying and Joining Tables Using Amazon EMR
- Prerequisites for Integrating Amazon EMR with DynamoDB
- Step 1: Create a Key Pair
- Step 2: Create a Cluster
- Step 3: SSH into the Master Node
- Step 4: Set Up a Hive Table to Run Hive Commands
- Hive Command Examples for Exporting, Importing, and Querying Data in DynamoDB
- Optimizing Performance for Amazon EMR Operations in DynamoDB
- Walkthrough: Using DynamoDB and Amazon EMR
In the following sections, you will learn how to use Amazon EMR with a customized version of Hive that includes connectivity to Amazon DynamoDB to perform operations on data stored in DynamoDB, such as:
Exporting data stored in DynamoDB to Amazon S3.
Importing data in Amazon S3 to DynamoDB.
Querying live DynamoDB data using SQL-like statements (HiveQL).
Joining data stored in DynamoDB and exporting it or querying against the joined data.
Loading DynamoDB data into the Hadoop Distributed File System (HDFS) and using it as input into an Amazon EMR job flow.
To perform each of the tasks above, you'll launch an Amazon EMR job flow, specify the location of the data in DynamoDB, and issue Hive commands to manipulate the data in DynamoDB.
Amazon EMR runs Apache Hadoop on Amazon EC2 instances. Hadoop is an application that implements the map-reduce algorithm, in which a computational task is mapped to multiple computers that work in parallel to process a task. The output of these computers is reduced together onto a single computer to produce the final result. Using Amazon EMR you can quickly and efficiently process large amounts of data, such as data stored in DynamoDB. For more information about Amazon EMR, go to the Amazon EMR Developer Guide.
Apache Hive is a software layer that you can use to query map reduce job flows using a simplified, SQL-like query language called HiveQL. It runs on top of the Hadoop architecture. For more information about Hive and HiveQL, go to the HiveQL Language Manual.
There are several ways to launch an Amazon EMR job flow: you can use the AWS Management Console Amazon EMR tab, the Amazon EMR command-line interface (CLI), or you can program your job flow using the AWS SDK or the low-level DynamoDB API. You can also choose whether to run a Hive job flow interactively or from a script. In this document, we will show you how to launch an interactive Hive job flow from the console and the CLI.
Using Hive interactively is a great way to test query performance and tune your application. Once you have established a set of Hive commands that will run on a regular basis, consider creating a Hive script that Amazon EMR can run for you. For more information about how to run Hive from a script, go to How to Create a Job Flow Using Hive.
Amazon EMR read and write operations on a DynamoDB table count against your established provisioned throughput, potentially increasing the frequency of provisioned throughput exceptions. For large requests, Amazon EMR implements retries with exponential backoff to manage the request load on the DynamoDB table. Running Amazon EMR jobs concurrently with other traffic may cause you to exceed the allocated provisioned throughput level. You can monitor this by checking the ThrottleRequests metric in CloudWatch. If the request load is too high, you can relaunch the job flow and set Read Percent Setting and Write Percent Setting to lower values to throttle the Amazon EMR read and write operations. For information about DynamoDB throughput settings, see Specifying Read and Write Requirements for Tables.
The integration of DynamoDB with Amazon EMR does not currently support Binary and Binary Set type attributes.