|« PreviousNext »|
|Did this page help you? Yes | No | Tell us about it...|
For Hive to run on Amazon EMR, you must create a cluster with Hive enabled. This sets up the necessary applications and infrastructure for Hive to connect to Amazon DynamoDB. The following procedures explain how to create an interactive Hive cluster from the AWS Management Console and the CLI.
Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
This opens the Amazon EMR console which you can use to launch and manage clusters.
Select a region from the Region drop-down box. This is the region in which you'll create the Amazon EMR cluster. To avoid cross-region data transfer charges, this should be the same region that hosts your Amazon DynamoDB data. Similarly, if you are exporting data to Amazon S3, the Amazon S3 bucket should be in the same region as both the Amazon DynamoDB and the Amazon EMR cluster to avoid cross-region data transfer charges.
Click Create New Job Flow.
On the DEFINE NEW JOB FLOW page, do the following:
Give your cluster a name, such as "My Cluster.
Select which version of Hadoop to run on your cluster in Hadoop Version. You can choose to run the Amazon distribution of Hadoop or one of two MapR distributions. For more information about MapR distributions for Hadoop, see Using the MapR Distribution for Hadoop.
Select the Run your own application radio button.
In the Choose a Job Type drop-down menu, choose Hive Program.
On the SPECIFY PARAMETERS page, select the Start an Interactive Hive Session radio button.
Hive is an open-source tool that runs on top of Hadoop to provide a way to query clusters using a simplified SQL syntax. Select an interactive session to issue commands from a terminal window.
Later, once you’ve established a set of queries that you’d like to run on a regular basis, you can save your queries as a script in an Amazon S3 bucket and have Amazon EMR run them for you without an interactive session.
On the CONFIGURE EC2 INSTANCES page, set the number and type of instances to process the data in parallel.
In the Master Instance Group, for Instance Type,
m1.small master node. In the Core Instance Group, for Instance Count
use the default value 2 and for Instance Type use the default value
If you need more processing power, select larger options.
On the ADVANCED OPTIONS page, select the key pair you created earlier in the Amazon EC2 Key Pair drop-down menu.
Leave the rest of the settings on this page at the default values. For example,
Amazon VPC Subnet Id should remain set to
In the Bootstrap Actions dialog:
Select the Proceed with no Bootstrap Actions radio button.
In the Review dialog:
Review the settings for your cluster.
Click Create Job Flow.
When the confirmation window closes, your new cluster appears in the list of clusters in the
Amazon EMR console with the status
STARTING. If you do not see your cluster
STARTING status, click Refresh to see the cluster.
It takes a few
minutes for Amazon EMR to provision the Amazon EC2 instances for your cluster.
Your cluster is ready for use when the status is
Download the Amazon EMR Ruby command line client (CLI). If you downloaded the Amazon EMR CLI before 11 December 2011, you will need to download and install the latest version to get support for AMI versioning, Amazon EMR AMI version 2.0, and Hadoop 0.20.205.
Install the command line client and set up your credentials. For information about how to do this, go to Sign Up and Install the Command Line Interface in the Amazon EMR Developer Guide.
Use the following syntax to start a new cluster, specifying your own values for the instance size and your own cluster name for
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information about the Amazon EMR CLI, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --alive --num-instances 3 \ --instance-type
"myJobFlowName"\ --hive-interactive --hive-versions 0.7.1.1 \ --ami-version latest \ --hadoop-version 0.20.205
ruby elastic-mapreduce --create --alive --num-instances 3 --instance-type
"myJobFlowName"--hive-interactive --hive-versions 0.7.1.1 --ami-version latest --hadoop-version 0.20.205
You must use the same account to create the Amazon EMR cluster that you used to store data in Amazon DynamoDB. This ensures that the credentials passed in by the CLI will match those required by Amazon DynamoDB.
After you create the cluster, you should wait until its status is
WAITING before continuing to the next step.