

# Amazon EMR 2.x and 3.x AMI versions
2.x and 3.x AMI versions

**Note**  
AWS is updating the TLS configuration for all AWS API endpoints to a minimum version of TLS 1.2. Amazon EMR releases 3.10 and lower only support TLS 1.0/1.1 connections. After December 4, 2023, you won't be able to create clusters with Amazon EMR 3.10 and lower.  
If you use Amazon EMR 3.10 or lower, we recommend that you immediately test and migrate your workloads to the latest Amazon EMR release. For more information, see the [AWS Security Blog](https://aws.amazon.com/blogs//security/tls-1-2-required-for-aws-endpoints/).

Amazon EMR 2.x and 3.x releases, called *AMI versions*, are made available for pre-existing solutions that require them for compatibility reasons. We do not recommend creating new clusters or new solutions with these release versions. They lack features of newer releases and include outdated application packages.

We recommend that you build solutions using the most recent Amazon EMR release version.

The scope of differences between the 2.x and 3.x series release versions and recent Amazon EMR release versions is significant. Those differences range from how you create and configure a cluster to the ports and directory structure of applications on the cluster.

This section attempts to cover the most significant differences for Amazon EMR, as well as specific application configuration and management differences. It is not comprehensive. If you create and use clusters in the 2.x or 3.x series, you may encounter differences not covered in this section.

**Topics**
+ [

# Creating a cluster with earlier AMI versions of Amazon EMR
](emr-3x-create.md)
+ [

# Installing applications with earlier AMI versions of Amazon EMR
](emr-3x-install-apps.md)
+ [

# Customizing cluster and application configuration with earlier AMI versions of Amazon EMR
](emr-3x-customizeappconfig.md)
+ [

# Hive application specifics for earlier AMI versions of Amazon EMR
](emr-3x-hive.md)
+ [

# HBase application specifics for earlier AMI versions of Amazon EMR
](emr-3x-hbase.md)
+ [

# Pig application specifics for earlier AMI versions of Amazon EMR
](emr-3x-pig.md)
+ [

# Spark application specifics with earlier AMI versions of Amazon EMR
](emr-3x-spark.md)
+ [

# S3DistCp utility differences with earlier AMI versions of Amazon EMR
](emr-3x-s3distcp.md)

# Creating a cluster with earlier AMI versions of Amazon EMR
Creating a cluster

Amazon EMR 2.x and 3.x releases are referenced by AMI version. With Amazon EMR release 4.0.0 and later, releases are referenced by release version, using a release label such as `emr-5.11.0`. This change is most apparent when you create a cluster using the AWS CLI or programmatically.

When you use the AWS CLI to create a cluster using an AMI release version, use the `--ami-version` option, for example, `--ami-version 3.11.0`. Many options, features, and applications introduced in Amazon EMR 4.0.0 and later are not available when you specify an `--ami-version`. For more information, see [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) in the *AWS CLI Command Reference*. 

The following example AWS CLI command launches a cluster using an AMI version.

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "Test cluster" --ami-version 3.11.0 \
--applications Name=Hue Name=Hive Name=Pig \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,\
InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,\
InstanceType=m3.xlarge --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,\
Name="Configuring infinite JVM reuse",Args=["-m","mapred.job.reuse.jvm.num.tasks=-1"]
```

Programmatically, all Amazon EMR release versions use the `RunJobFlowRequest` action in the EMR API to create clusters. The following example Java code creates a cluster using AMI release version 3.11.0.

```
RunJobFlowRequest request = new RunJobFlowRequest()
			.withName("AmiVersion Cluster")
			.withAmiVersion("3.11.0")
			.withInstances(new JobFlowInstancesConfig()
				.withEc2KeyName("myKeyPair")
				.withInstanceCount(1)
				.withKeepJobFlowAliveWhenNoSteps(true)
				.withMasterInstanceType("m3.xlarge")
				.withSlaveInstanceType("m3.xlarge");
```

The following `RunJobFlowRequest` call uses a release label instead:

```
RunJobFlowRequest request = new RunJobFlowRequest()
			.withName("ReleaseLabel Cluster")
			.withReleaseLabel("emr-7.12.0")
			.withInstances(new JobFlowInstancesConfig()
				.withEc2KeyName("myKeyPair")
				.withInstanceCount(1)
				.withKeepJobFlowAliveWhenNoSteps(true)
				.withMasterInstanceType("m3.xlarge")
				.withSlaveInstanceType("m3.xlarge");
```

## Configuring cluster size


When your cluster runs, Hadoop determines the number of mapper and reducer tasks needed to process the data. Larger clusters should have more tasks for better resource use and shorter processing time. Typically, an EMR cluster remains the same size during the entire cluster; you set the number of tasks when you create the cluster. When you resize a running cluster, you can vary the processing during the cluster execution. Therefore, instead of using a fixed number of tasks, you can vary the number of tasks during the life of the cluster. There are two configuration options to help set the ideal number of tasks:
+ `mapred.map.tasksperslot`
+ `mapred.reduce.tasksperslot`

You can set both options in the `mapred-conf.xml` file. When you submit a job to the cluster, the job client checks the current total number of map and reduce slots available clusterwide. The job client then uses the following equations to set the number of tasks: 
+ `mapred.map.tasks` =` mapred.map.tasksperslot` \$1 map slots in cluster
+ `mapred.reduce.tasks` = `mapred.reduce.tasksperslot` \$1 reduce slots in cluster

The job client only reads the `tasksperslot` parameter if the number of tasks is not configured. You can override the number of tasks at any time, either for all clusters via a bootstrap action or individually per job by adding a step to change the configuration. 

Amazon EMR withstands task node failures and continues cluster execution even if a task node becomes unavailable. Amazon EMR automatically provisions additional task nodes to replace those that fail. 

You can have a different number of task nodes for each cluster step. You can also add a step to a running cluster to modify the number of task nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running task nodes for any step. 

# Installing applications with earlier AMI versions of Amazon EMR
Installing applications

When using an AMI version, applications are installed in any number of ways, including using the `NewSupportedProducts` parameter for the [RunJobFlow](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_RunJobFlow.html) action, using bootstrap actions, and using the [Step](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_Step.html) action.

# Customizing cluster and application configuration with earlier AMI versions of Amazon EMR
Customizing configurations

Amazon EMR release version 4.0.0 introduced a simplified method of configuring applications using configuration classifications. For more information, see [Configure applications](emr-configure-apps.md). When using an AMI version, you configure applications using bootstrap actions along with arguments that you pass. For example, the `configure-hadoop` and `configure-daemons` bootstrap actions set Hadoop and YARN–specific environment properties like `--namenode-heap-size`. In more recent versions, these are configured using the `hadoop-env` and `yarn-env` configuration classifications. For bootstrap actions that perform common configurations, see the [emr-bootstrap-actions repository on Github](https://github.com/awslabs/emr-bootstrap-actions).

The following tables map bootstrap actions to configuration classifications in more recent Amazon EMR release versions.


**Hadoop**  

| Affected application file name | AMI version bootstrap action | Configuration classification | 
| --- | --- | --- | 
| core-site.xml  | configure-hadoop -c  | core-site | 
| log4j.properties  | configure-hadoop -l  | hadoop-log4j | 
| hdfs-site.xml  | configure-hadoop -s  | hdfs-site  | 
| n/a | n/a | hdfs-encryption-zones | 
| mapred-site.xml  | configure-hadoop -m  | mapred-site | 
| yarn-site.xml  | configure-hadoop -y  | yarn-site | 
| httpfs-site.xml  | configure-hadoop -t  | httpfs-site | 
| capacity-scheduler.xml  | configure-hadoop -z  | capacity-scheduler | 
| yarn-env.sh  | configure-daemons --resourcemanager-opts | yarn-env | 


**Hive**  

| Affected application file name | AMI version bootstrap action | Configuration classification | 
| --- | --- | --- | 
| hive-env.sh | n/a | hive-env | 
| hive-site.xml | hive-script --install-hive-site \$1\$1MY\$1HIVE\$1SITE\$1FILE\$1 | hive-site | 
| hive-exec-log4j.properties | n/a | hive-exec-log4j | 
| hive-log4j.properties | n/a | hive-log4j | 


**EMRFS**  

| Affected application file name | AMI version bootstrap action | Configuration classification | 
| --- | --- | --- | 
| emrfs-site.xml | configure-hadoop -e | emrfs-site | 
| n/a | s3get -s s3://custom-provider.jar -d /usr/share/aws/emr/auxlib/ | emrfs-site (with new setting fs.s3.cse.encryptionMaterialsProvider.uri) | 

For a list of all classifications, see [Configure applications](emr-configure-apps.md).

## Application environment variables


When using an AMI version, a `hadoop-user-env.sh` script is used along with the `configure-daemons` bootstrap action to configure the Hadoop environment. The script includes the following actions:

```
#!/bin/bash 
export HADOOP_USER_CLASSPATH_FIRST=true; 
echo "HADOOP_CLASSPATH=/path/to/my.jar" >> /home/hadoop/conf/hadoop-user-env.sh
```

In Amazon EMR release 4.x, you do the same using the `hadoop-env` configuration classification, as shown in the following example:

```
[ 
      { 
         "Classification":"hadoop-env",
         "Properties":{ 

         },
         "Configurations":[ 
            { 
               "Classification":"export",
               "Properties":{ 
                  "HADOOP_USER_CLASSPATH_FIRST":"true",
                  "HADOOP_CLASSPATH":"/path/to/my.jar"
               }
            }
         ]
      }
   ]
```

As another example, using `configure-daemons` and passing `--namenode-heap-size=2048` and `--namenode-opts=-XX:GCTimeRatio=19` is equivalent to the following configuration classifications.

```
[ 
      { 
         "Classification":"hadoop-env",
         "Properties":{ 

         },
         "Configurations":[ 
            { 
               "Classification":"export",
               "Properties":{ 
                  "HADOOP_DATANODE_HEAPSIZE":  "2048",
           	"HADOOP_NAMENODE_OPTS":  "-XX:GCTimeRatio=19"
               }
            }
         ]
      }
   ]
```

Other application environment variables are no longer defined in `/home/hadoop/.bashrc`. Instead, they are primarily set in `/etc/default` files per component or application, such as `/etc/default/hadoop`. Wrapper scripts in `/usr/bin/` installed by application RPMs may also set additional environment variables before involving the actual bin script.

## Service ports


When using an AMI version, some services use custom ports.


**Changes in port settings**  

| Setting | AMI version 3.x | Open-source default | 
| --- | --- | --- | 
| fs.default.name | hdfs://emrDeterminedIP:9000 | default (hdfs://emrDeterminedIP:8020)  | 
| dfs.datanode.address | 0.0.0.0:9200 | default (0.0.0.0:50010)  | 
| dfs.datanode.http.address | 0.0.0.0:9102 | default (0.0.0.0:50075)  | 
| dfs.datanode.https.address | 0.0.0.0:9402 | default (0.0.0.0:50475) | 
| dfs.datanode.ipc.address | 0.0.0.0:9201 | default (0.0.0.0:50020) | 
| dfs.http.address | 0.0.0.0:9101 | default (0.0.0.0:50070)  | 
| dfs.https.address | 0.0.0.0:9202 | default (0.0.0.0:50470)  | 
| dfs.secondary.http.address | 0.0.0.0:9104 | default (0.0.0.0:50090) | 
| yarn.nodemanager.address | 0.0.0.0:9103 | default (\$1\$1yarn.nodemanager.hostname\$1:0)  | 
| yarn.nodemanager.localizer.address  | 0.0.0.0:9033 | default (\$1\$1yarn.nodemanager.hostname\$1:8040) | 
| yarn.nodemanager.webapp.address | 0.0.0.0:9035 | default (\$1\$1yarn.nodemanager.hostname\$1:8042) | 
| yarn.resourcemanager.address | emrDeterminedIP:9022 | default (\$1\$1yarn.resourcemanager.hostname\$1:8032) | 
| yarn.resourcemanager.admin.address | emrDeterminedIP:9025 | default (\$1\$1yarn.resourcemanager.hostname\$1:8033) | 
| yarn.resourcemanager.resource-tracker.address | emrDeterminedIP:9023 | default (\$1\$1yarn.resourcemanager.hostname\$1:8031) | 
| yarn.resourcemanager.scheduler.address | emrDeterminedIP:9024 | default (\$1\$1yarn.resourcemanager.hostname\$1:8030) | 
| yarn.resourcemanager.webapp.address | 0.0.0.0:9026  | default (\$1\$1yarn.resourcemanager.hostname\$1:8088) | 
| yarn.web-proxy.address | emrDeterminedIP:9046  | default (no-value)  | 
| yarn.resourcemanager.hostname | 0.0.0.0 (default)  | emrDeterminedIP | 

**Note**  
The *emrDeterminedIP* is an IP address that is generated by Amazon EMR.

## Users


When using an AMI version, the user `hadoop` runs all processes and owns all files. In Amazon EMR release version 4.0.0 and later, users exist at the application and component level.

## Installation sequence, installed artifacts, and log file locations


When using an AMI version, application artifacts and their configuration directories are installed in the `/home/hadoop/application` directory. For example, if you installed Hive, the directory would be `/home/hadoop/hive`. In Amazon EMR release 4.0.0 and later, application artifacts are installed in the `/usr/lib/application` directory. When using an AMI version, log files are found in various places. The table below lists locations.


**Changes in log locations on Amazon S3**  

| Daemon or application | Directory location | 
| --- | --- | 
| instance-state | node/instance-id/instance-state/ | 
| hadoop-hdfs-namenode | daemons/instance-id/hadoop-hadoop-namenode.log | 
| hadoop-hdfs-datanode | daemons/instance-id/hadoop-hadoop-datanode.log | 
| hadoop-yarn (ResourceManager) | daemons/instance-id/yarn-hadoop-resourcemanager | 
| hadoop-yarn (Proxy Server) | daemons/instance-id/yarn-hadoop-proxyserver | 
| mapred-historyserver | daemons/instance-id/ | 
| httpfs | daemons/instance-id/httpfs.log | 
| hive-server | node/instance-id/hive-server/hive-server.log | 
| hive-metastore | node/instance-id/apps/hive.log | 
| Hive CLI | node/instance-id/apps/hive.log | 
| YARN applications user logs and container logs | task-attempts/ | 
| Mahout | N/A | 
| Pig | N/A | 
| spark-historyserver | N/A | 
| mapreduce job history files | jobs/ | 

## Command runner


When using an AMI version, many scripts or programs, like `/home/hadoop/contrib/streaming/hadoop-streaming.jar`, are not placed on the shell login path environment, so you need to specify the full path when you use a jar file such as command-runner.jar or script-runner.jar to execute the scripts. The `command-runner.jar` is located on the AMI so there is no need to know a full URI as was the case with `script-runner.jar`. 

## Replication factor


The replication factor lets you configure when to start a Hadoop JVM. You can start a new Hadoop JVM for every task, which provides better task isolation, or you can share JVMs between tasks, providing lower framework overhead. If you are processing many small files, it makes sense to reuse the JVM many times to amortize the cost of start-up. However, if each task takes a long time or processes a large amount of data, then you might choose to not reuse the JVM to ensure that all memory is freed for subsequent tasks. When using an AMI version, you can customize the replication factor using the `configure-hadoop` bootstrap action to set the `mapred.job.reuse.jvm.num.tasks` property. 

The following example demonstrates setting the JVM reuse factor for infinite JVM reuse.

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "Test cluster" --ami-version 3.11.0 \
--applications Name=Hue Name=Hive Name=Pig \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge \
--bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,\
Name="Configuring infinite JVM reuse",Args=["-m","mapred.job.reuse.jvm.num.tasks=-1"]
```

# Hive application specifics for earlier AMI versions of Amazon EMR
Hive

## Log files


Using Amazon EMR AMI versions 2.x and 3.x, Hive logs are saved to `/mnt/var/log/apps/`. In order to support concurrent versions of Hive, the version of Hive that you run determines the log file name, as shown in the following table. 


| Hive version | Log file name | 
| --- | --- | 
| 0.13.1 | hive.log  Beginning with this version, Amazon EMR uses an unversioned file name, `hive.log`. Minor versions share the same log location as the major version.   | 
| 0.11.0 | hive\$10110.log   Minor versions of Hive 0.11.0, such as 0.11.0.1, share the same log file location as Hive 0.11.0.   | 
| 0.8.1 | hive\$1081.log   Minor versions of Hive 0.8.1, such as Hive 0.8.1.1, share the same log file location as Hive 0.8.1.   | 
| 0.7.1 | hive\$107\$11.log   Minor versions of Hive 0.7.1, such as Hive 0.7.1.3 and Hive 0.7.1.4, share the same log file location as Hive 0.7.1.    | 
| 0.7 | hive\$107.log | 
| 0.5 | hive\$105.log | 
| 0.4 | hive.log | 

## Split input functionality


To implement split input functionality using Hive versions earlier than 0.13.1 (Amazon EMR AMI versions earlier 3.11.0), use the following:

```
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveCombineSplitsInputFormat;
hive> set mapred.min.split.size=100000000;
```

This functionality was deprecated with Hive 0.13.1. To get the same split input format functionality in Amazon EMR AMI Version 3.11.0, use the following:

```
set hive.hadoop.supports.splittable.combineinputformat=true;
```

## Thrift service ports


 Thrift is an RPC framework that defines a compact binary serialization format used to persist data structures for later analysis. Normally, Hive configures the server to operate on the following ports. 


| Hive version | Port number | 
| --- | --- | 
| Hive 0.13.1 | 10000 | 
| Hive 0.11.0 | 10004 | 
| Hive 0.8.1 | 10003 | 
| Hive 0.7.1 | 10002 | 
| Hive 0.7 | 10001 | 
| Hive 0.5 | 10000 | 

 For more information about thrift services, see [http://wiki.apache.org/thrift/](http://wiki.apache.org/thrift/). 

## Use Hive to recover partitions


Amazon EMR includes a statement in the Hive query language that recovers the partitions of a table from table data located in Amazon S3. The following example shows this. 

```
CREATE EXTERNAL TABLE (json string) raw_impression 
PARTITIONED BY (dt string) 
LOCATION 's3://elastic-mapreduce/samples/hive-ads/tables/impressions';
ALTER TABLE logs RECOVER PARTITIONS;
```

The partition directories and data must be at the location specified in the table definition and must be named according to the Hive convention: for example, `dt=2009-01-01`. 

**Note**  
After Hive 0.13.1 this capability is supported natively using `msck repair table` and therefore `recover partitions` is not supported. For more information, see [https://cwiki.apache.org/confluence/display/Hive/LanguageManual\$1DDL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL).

## Pass a Hive variable to a script


To pass a variable into a Hive step using the AWS CLI, type the following command, replace *myKey* with the name of your EC2 key pair, and replace *amzn-s3-demo-bucket* with your bucket name. In this example, `SAMPLE` is a variable value preceded by the `-d` switch. This variable is defined in the Hive script as: `${SAMPLE}`.

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "Test cluster" --ami-version 3.9 \
--applications Name=Hue Name=Hive Name=Pig \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-type m3.xlarge --instance-count 3 \
--steps Type=Hive,Name="Hive Program",ActionOnFailure=CONTINUE,\
Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/response-time-stats.q,-d,\
INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://amzn-s3-demo-bucket/hive-ads/output/,\
-d,SAMPLE=s3://elasticmapreduce/samples/hive-ads/]
```

## Specify an external metastore location


The following procedure shows you how to override the default configuration values for the Hive metastore location and start a cluster using the reconfigured metastore location.

**To create a metastore located outside of the EMR cluster**

1. Create a MySQL or Aurora database using Amazon RDS.

   For information about how to create an Amazon RDS database, see [Getting started with Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.html).

1. Modify your security groups to allow JDBC connections between your database and the **ElasticMapReduce-Master** security group.

   For information about how to modify your security groups for access, see [Amazon RDS security groups](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) in the *Amazon RDS User Guide*.

1. Set the JDBC configuration values in `hive-site.xml`:

   1. Create a `hive-site.xml` configuration file containing the following:

      ```
      <configuration>
        <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mariadb://hostname:3306/hive?createDatabaseIfNotExist=true</value>
          <description>JDBC connect string for a JDBC metastore</description>
        </property>
        <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>hive</value>
          <description>Username to use against metastore database</description>
        </property>
        <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>password</value>
          <description>Password to use against metastore database</description>
        </property>
      </configuration>
      ```

      *hostname* is the DNS address of the Amazon RDS instance running the database. *username* and *password* are the credentials for your database. For more information about connecting to MySQL and Aurora database instances, see [Connecting to a DB instance running the MySQL database engine](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToInstance.html) and [Connecting to an Aurora DB cluster](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Connecting.html) in the *Amazon RDS User Guide*.

      The JDBC drivers are installed by Amazon EMR. 
**Note**  
The value property should not contain any spaces or carriage returns. It should appear all on one line.

   1. Save your `hive-site.xml` file to a location on Amazon S3, such as `s3://amzn-s3-demo-bucket/hive-site.xml`.

1. Create a cluster, specifying the Amazon S3 location of the customized `hive-site.xml` file.

   The following example command demonstrates an AWS CLI command that does this.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

   ```
   aws emr create-cluster --name "Test cluster" --ami-version 3.10 \
   --applications Name=Hue Name=Hive Name=Pig \
   --use-default-roles --ec2-attributes KeyName=myKey \
   --instance-type m3.xlarge --instance-count 3 \
   --bootstrap-actions Name="Install Hive Site Configuration",\
   Path="s3://region.elasticmapreduce/libs/hive/hive-script",\
   Args=["--base-path","s3://elasticmapreduce/libs/hive","--install-hive-site",\
   "--hive-site=s3://amzn-s3-demo-bucket/hive-site.xml","--hive-versions","latest"]
   ```

## Connect to Hive using JDBC


To connect to Hive via JDBC requires you to download the JDBC driver and install a SQL client. The following example demonstrates using SQL Workbench/J to connect to Hive using JDBC.

**To download JDBC drivers**

1. Download and extract the drivers appropriate to the versions of Hive that you want to access. The Hive version differs depending on the AMI that you choose when you create an Amazon EMR cluster.
   + Hive 0.13.1 JDBC drivers: [https://amazon-odbc-jdbc-drivers.s3.amazonaws.com/public/AmazonHiveJDBC\$11.0.4.1004.zip](https://amazon-odbc-jdbc-drivers.s3.amazonaws.com/public/AmazonHiveJDBC_1.0.4.1004.zip)
   + Hive 0.11.0 JDBC drivers: [https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/0.11.0](https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/0.11.0)
   + Hive 0.8.1 JDBC drivers: [https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/0.8.1](https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/0.8.1)

1. Install SQL Workbench/J. For more information, see [Installing and starting SQL Workbench/J](http://www.sql-workbench.net/manual/install.html) in the SQL Workbench/J Manual User's Manual.

1. Create an SSH tunnel to the cluster master node. The port for connection is different depending on the version of Hive. Example commands are provided in the tables below for Linux `ssh` users and PuTTY commands for Windows users  
**Linux SSH commands**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-3x-hive.html)  
**Windows PuTTY tunnel settings**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-3x-hive.html)

1. Add the JDBC driver to SQL Workbench.

   1. In the **Select Connection Profile** dialog box, choose **Manage Drivers**. 

   1. Choose the **Create a new entry** (blank page) icon.

   1. In the **Name** field, type **Hive JDBC**.

   1. For **Library**, click the **Select the JAR file(s)** icon.

   1. Select JAR files as shown in the following table.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-3x-hive.html)

   1. In the **Please select one driver** dialog box, select a driver according to the following table and click **OK**.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-3x-hive.html)

1. When you return to the **Select Connection Profile** dialog box, verify that the **Driver** field is set to **Hive JDBC** and provide the JDBC connection string in the **URL** field according to the following table.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-3x-hive.html)

   If your cluster uses AMI version 3.3.1 or later, in the **Select Connection Profile** dialog box, type **hadoop** in the **Username** field.

# HBase application specifics for earlier AMI versions of Amazon EMR
HBase

## Supported HBase versions



| HBase version | AMI version | AWS CLI configuration parameters | HBase version details | 
| --- | --- | --- | --- | 
| [0.94.18](https://svn.apache.org/repos/asf/hbase/branches/0.94/CHANGES.txt) | 3.1.0 and later |  `--ami-version 3.1` `--ami-version 3.2` `--ami-version 3.3` `--applications Name=HBase`  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-3x-hbase.html)  | 
| [0.94.7](https://svn.apache.org/repos/asf/hbase/branches/0.94/CHANGES.txt) | 3.0-3.0.4 |  `--ami-version 3.0` `--applications Name=HBase`  | 
| [0.92](https://svn.apache.org/repos/asf/hbase/branches/0.92/CHANGES.txt) | 2.2 and later |  `--ami-version 2.2 or later` `--applications Name=HBase`  | 

## HBase cluster prerequisites


A cluster created using Amazon EMR AMI versions 2.x and 3.x should meet the following requirements for HBase.
+ The AWS CLI (optional)—To interact with HBase using the command line, download and install the latest version of the AWS CLI. For more information, see [Installing the AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/installing.html) in the *AWS Command Line Interface User Guide*.
+ At least two instances (optional)—The cluster's master node runs the HBase master server and Zookeeper, and task nodes run the HBase region servers. For best performance, HBase clusters should run on at least two EC2 instances, but you can run HBase on a single node for evaluation purposes. 
+ Long-running cluster—HBase only runs on long-running clusters. By default, the CLI and Amazon EMR console create long-running clusters. 
+ An Amazon EC2 key pair set (recommended)—To use the Secure Shell (SSH) network protocol to connect with the master node and run HBase shell commands, you must use an Amazon EC2 key pair when you create the cluster. 
+ The correct AMI and Hadoop versions—HBase clusters are currently supported only on Hadoop 20.205 or later. 
+ Ganglia (optional)—To monitor HBase performance metrics, install Ganglia when you create the cluster. 
+ An Amazon S3 bucket for logs (optional)—The logs for HBase are available on the master node. If you'd like these logs copied to Amazon S3, specify an S3 bucket to receive log files when you create the cluster. 

## Creating a cluster with HBase


The following table lists options that are available when using the console to create a cluster with HBase using an Amazon EMR AMI release version.


| Field | Action | 
| --- | --- | 
| Restore from backup | Specify whether to pre-load the HBase cluster with data stored in Amazon S3. | 
| Backup location | Specify the URI where the backup from which to restore resides in Amazon S3.  | 
| Backup version | Optionally, specify the version name of the backup at Backup Location to use. If you leave this field blank, Amazon EMR uses the latest backup at Backup Location to populate the new HBase cluster.  | 
| Schedule Regular Backups | Specify whether to schedule automatic incremental backups. The first backup is a full backup to create a baseline for future incremental backups. | 
| Consistent backup | Specify whether the backups should be consistent. A consistent backup is one that pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when synchronization completes. | 
| Backup frequency | The number of days/hours/minutes between scheduled backups. | 
| Backup location | The Amazon S3 URI where backups are stored. The backup location for each HBase cluster should be different to ensure that differential backups stay correct.  | 
| Backup start time | Specify when the first backup should occur. You can set this to now, which causes the first backup to start as soon as the cluster is running, or enter a date and time in [ISO format](http://www.w3.org/TR/NOTE-datetime). For example, 2012-06-15T20:00Z would set the start time to June 15, 2012 at 8PM UTC.  | 

The following example AWS CLI command launches a cluster with HBase and other applications:

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
               --applications Name=Hue Name=Hive Name=Pig Name=HBase \
               --use-default-roles --ec2-attributes KeyName=myKey \
               --instance-type c1.xlarge --instance-count 3 --termination-protected
```

After the connection between the Hive and HBase clusters has been made (as shown in the previous procedure), you can access the data stored on the HBase cluster by creating an external table in Hive. 

The following example, when run from the Hive prompt, creates an external table that references data stored in an HBase table called `inputTable`. You can then reference `inputTable` in Hive statements to query and modify data stored in the HBase cluster. 

**Note**  
The following example uses **protobuf-java-2.4.0a.jar** in AMI 2.3.3, but you should modify the example to match your version. To check which version of the Protocol Buffers JAR you have, run the command at the Hive command prompt: `! ls /home/hadoop/lib;`. 

```
add jar lib/emr-metrics-1.0.jar ;
               add jar lib/protobuf-java-2.4.0a.jar ;
               
               set hbase.zookeeper.quorum=ec2-107-21-163-157.compute-1.amazonaws.com ;
               
               create external table inputTable (key string, value string)
                    stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
                     with serdeproperties ("hbase.columns.mapping" = ":key,f1:col1")
                     tblproperties ("hbase.table.name" = "t1");
               
               select count(*) from inputTable ;
```

## Customizing HBase configuration


Although the default settings should work for most applications, you have the flexibility to modify your HBase configuration settings. To do this, run one of two bootstrap action scripts: 
+ **configure-hbase-daemons**—Configures properties of the master, regionserver, and zookeeper daemons. These properties include heap size and options to pass to the Java Virtual Machine (JVM) when the HBase daemon starts. You set these properties as arguments in the bootstrap action. This bootstrap action modifies the /home/hadoop/conf/hbase-user-env.sh configuration file on the HBase cluster. 
+ **configure-hbase**—Configures HBase site-specific settings such as the port the HBase master should bind to and the maximum number of times the client CLI client should retry an action. You can set these one-by-one, as arguments in the bootstrap action, or you can specify the location of an XML configuration file in Amazon S3. This bootstrap action modifies the /home/hadoop/conf/hbase-site.xml configuration file on the HBase cluster. 

**Note**  
These scripts, like other bootstrap actions, can only be run when the cluster is created; you cannot use them to change the configuration of an HBase cluster that is currently running. 

When you run the **configure-hbase** or **configure-hbase-daemons** bootstrap actions, the values you specify override the default values. Any values that you don't explicitly set receive the default values. 

Configuring HBase with these bootstrap actions is analogous to using bootstrap actions in Amazon EMR to configure Hadoop settings and Hadoop daemon properties. The difference is that HBase does not have per-process memory options. Instead, memory options are set using the `--daemon-opts` argument, where *daemon* is replaced by the name of the daemon to configure. 

### Configure HBase daemons


 Amazon EMR provides a bootstrap action, `s3://region.elasticmapreduce/bootstrap-actions/configure-hbase-daemons`, that you can use to change the configuration of HBase daemons, where *region* is the region into which you're launching your HBase cluster. 

To configure HBase daemons using the AWS CLI, add the `configure-hbase-daemons` bootstrap action when you launch the cluster to configure one or more HBase daemons. You can set the following properties. 


| Property | Description | 
| --- | --- | 
| hbase-master-opts | Options that control how the JVM runs the master daemon. If set, these override the default HBASE\$1MASTER\$1OPTS variables.  | 
| regionserver-opts | Options that control how the JVM runs the region server daemon. If set, these override the default HBASE\$1REGIONSERVER\$1OPTS variables. | 
| zookeeper-opts | Options that control how the JVM runs the zookeeper daemon. If set, these override the default HBASE\$1ZOOKEEPER\$1OPTS variables.  | 

For more information about these options, see [hbase-env.sh](https://hbase.apache.org/book.html#hbase.env.sh) in the HBase documentation. 

A bootstrap action to configure values for `zookeeper-opts` and `hbase-master-opts` is shown in the following example.

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
--applications Name=Hue Name=Hive Name=Pig Name=HBase \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-type c1.xlarge --instance-count 3 --termination-protected \
--bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase-daemons,\
Args=["--hbase-zookeeper-opts=-Xmx1024m -XX:GCTimeRatio=19","--hbase-master-opts=-Xmx2048m","--hbase-regionserver-opts=-Xmx4096m"]
```

### Configure HBase site settings


Amazon EMR provides a bootstrap action, `s3://elasticmapreduce/bootstrap-actions/configure-hbase`, that you can use to change the configuration of HBase. You can set configuration values one-by-one, as arguments in the bootstrap action, or you can specify the location of an XML configuration file in Amazon S3. Setting configuration values one-by-one is useful if you only need to set a few configuration settings. Setting them using an XML file is useful if you have many changes to make, or if you want to save your configuration settings for reuse. 

**Note**  
You can prefix the Amazon S3 bucket name with a region prefix, such as `s3://region.elasticmapreduce/bootstrap-actions/configure-hbase`, where *region* is the region into which you're launching your HBase cluster. 

This bootstrap action modifies the `/home/hadoop/conf/hbase-site.xml` configuration file on the HBase cluster. The bootstrap action can only be run when the HBase cluster is launched.

For more information about the HBase site settings that you can configure, see [Default configuration](http://hbase.apache.org/book.html#config.files) in the HBase documentation. 

Set the `configure-hbase` bootstrap action when you launch the HBase cluster and specify the values in `hbase-site.xml` to change.

**To specify individual HBase site settings using the AWS CLI**
+ To change the `hbase.hregion.max.filesize` setting, type the following command and replace *myKey* with the name of your Amazon EC2 key pair.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
  --applications Name=Hue Name=Hive Name=Pig Name=HBase \
  --use-default-roles --ec2-attributes KeyName=myKey \
  --instance-type c1.xlarge --instance-count 3 --termination-protected \
  --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["-s","hbase.hregion.max.filesize=52428800"]
  ```

**To specify HBase site settings with an XML file using the AWS CLI**

1. Create a custom version of `hbase-site.xml`. Your custom file must be valid XML. To reduce the chance of introducing errors, start with the default copy of `hbase-site.xml`, located on the Amazon EMR HBase master node at `/home/hadoop/conf/hbase-site.xml`, and edit a copy of that file instead of creating a file from scratch. You can give your new file a new name, or leave it as `hbase-site.xml`. 

1. Upload your custom `hbase-site.xml` file to an Amazon S3 bucket. It should have permissions set so the AWS account that launches the cluster can access the file. If the AWS account launching the cluster also owns the Amazon S3 bucket, it has access. 

1. Set the **configure-hbase** bootstrap action when you launch the HBase cluster, and include the location of your custom `hbase-site.xml` file. The following example sets the HBase site configuration values to those specified in the file `s3://amzn-s3-demo-bucket/my-hbase-site.xml`. Type the following command, replace *myKey* with the name of your EC2 key pair, and replace *amzn-s3-demo-bucket* with the name of your Amazon S3 bucket.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

   ```
   aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
           --applications Name=Hue Name=Hive Name=Pig Name=HBase \
           --use-default-roles --ec2-attributes KeyName=myKey \
           --instance-type c1.xlarge --instance-count 3 --termination-protected \
           --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["--site-config-file","s3://amzn-s3-demo-bucket/config.xml"]
   ```

   If you specify more than one option to customize HBase operation, you must prepend each key-value pair with a `-s` option switch, as shown in the following example:

   ```
          --bootstrap-actions s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["-s","zookeeper.session.timeout=60000"]
   ```

With the proxy set and the SSH connection open, you can view the HBase UI by opening a browser window with http://*master-public-dns-name*:60010/master-status, where *master-public-dns-name* is the public DNS address of the master node in the HBase cluster. 

You can view the current HBase logs by using SSH to connect to the master node, and navigating to the `mnt/var/log/hbase` directory. These logs are not available after the cluster is terminated unless you enable logging to Amazon S3 when the cluster is launched.

## Back up and restore HBase


Amazon EMR provides the ability to back up your HBase data to Amazon S3, either manually or on an automated schedule. You can perform both full and incremental backups. After you have a backed-up version of HBase data, you can restore that version to an HBase cluster. You can restore to an HBase cluster that is currently running, or launch a new cluster pre-populated with backed-up data. 

During the backup process, HBase continues to execute write commands. Although this ensures that your cluster remains available throughout the backup, there is the risk of inconsistency between the data being backed up and any write operations being executed in parallel. To understand the inconsistencies that might arise, you have to consider that HBase distributes write operations across the nodes in its cluster. If a write operation happens after a particular node is polled, that data is not included in the backup archive. You may even find that earlier writes to the HBase cluster (sent to a node that has already been polled) might not be in the backup archive, whereas later writes (sent to a node before it was polled) are included. 

If a consistent backup is required, you must pause writes to HBase during the initial portion of the backup process, synchronization across nodes. You can do this by specifying the `--consistent` parameter when requesting a backup. With this parameter, writes during this period are queued and executed as soon as the synchronization completes. You can also schedule recurring backups, which resolves any inconsistencies over time, as data that is missed on one backup pass is backed up on the following pass. 

When you back up HBase data, you should specify a different backup directory for each cluster. An easy way to do this is to use the cluster identifier as part of the path specified for the backup directory. For example, `s3://amzn-s3-demo-bucket/backups/j-3AEXXXXXX16F2`. This ensures that any future incremental backups reference the correct HBase cluster. 

When you are ready to delete old backup files that are no longer needed, we recommend that you first do a full backup of your HBase data. This ensures that all data is preserved and provides a baseline for future incremental backups. After the full backup is done, you can navigate to the backup location and manually delete the old backup files. 

The HBase backup process uses S3DistCp for the copy operation, which has certain limitations regarding temporary file storage space. 

### Back up and restore HBase using the console


The console provides the ability to launch a new cluster and populate it with data from a previous HBase backup. It also gives you the ability to schedule periodic incremental backups of HBase data. Additional backup and restore functionality, such as the ability to restore data to an already running cluster, do manual backups, and schedule automated full backups, is available using the CLI.

**To populate a new cluster with archived HBase data using the console**

1. Navigate to the new Amazon EMR console and select **Switch to the old console** from the side navigation. For more information on what to expect when you switch to the old console, see [Using the old console](https://docs.aws.amazon.com/emr/latest/ManagementGuide/whats-new-in-console.html#console-opt-in).

1. Choose **Create cluster**.

1. In the **Software Configuration** section, for **Additional Applications**, choose **HBase** and **Configure and add**.

1. On the **Add Application** dialog box, check **Restore From Backup**. 

1. For **Backup Location**, specify the location of the backup yto load into the new HBase cluster. This should be an Amazon S3 URL of the form `s3://amzn-s3-demo-bucket/backups/`. 

1. For **Backup Version**, you have the option to specify the name of a backup version to load by setting a value. If you do not set a value for **Backup Version**, Amazon EMR loads the latest backup in the specified location. 

1. Choose **Add** and proceed to create the cluster with other options as desired.

**To schedule automated backups of HBase data using the console**

1. In the **Software Configuration** section, for **Additional Applications**, choose **HBase** and **Configure and add**.

1. Choose **Schedule Regular Backups**.

1. Specify whether the backups should be consistent. A consistent backup is one that pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when the synchronization completes. 

1. Set how often backups should occur by entering a number for **Backup Frequency** and choosing **Days**, **Hours**, or **Minutes**. The first automated backup that runs is a full backup; after that, Amazon EMR saves incremental backups based on the schedule that you specify. 

1. Specify the location in Amazon S3 where the backups should be stored. Each HBase cluster should be backed up to a separate location in Amazon S3 to ensure that incremental backups are calculated correctly. 

1. Specify when the first backup should occur by setting a value for **Backup Start Time**. You can set this to `now`, which causes the first backup to start as soon as the cluster is running, or enter a date and time in [ISO format](http://www.w3.org/TR/NOTE-datetime). For example, 2013-09-26T20:00Z, sets the start time to September 26, 2013 at 8PM UTC. 

1. Choose **Add**.

1. Proceed with creating the cluster with other options as desired.

## Monitor HBase with CloudWatch


Amazon EMR reports three metrics to CloudWatch that you can use to monitor your HBase backups. These metrics are pushed to CloudWatch at five-minute intervals, and are provided without charge.


| Metric | Description | 
| --- | --- | 
| HBaseBackupFailed |  Whether the last backup failed. This is set to 0 by default and updated to 1 if the previous backup attempt failed. This metric is only reported for HBase clusters. Use case: Monitor HBase backups Units: *Count*  | 
| HBaseMostRecentBackupDuration |  The amount of time it took the previous backup to complete. This metric is set regardless of whether the last completed backup succeeded or failed. While the backup is ongoing, this metric returns the number of minutes after the backup started. This metric is only reported for HBase clusters. Use case: Monitor HBase Backups Units: *Minutes*  | 
| HBaseTimeSinceLastSuccessfulBackup |  The number of elapsed minutes after the last successful HBase backup started on your cluster. This metric is only reported for HBase clusters. Use case: Monitor HBase backups Units: *Minutes*  | 

## Configure Ganglia for HBase


You configure Ganglia for HBase using the **configure-hbase-for-ganglia** bootstrap action. This bootstrap action configures HBase to publish metrics to Ganglia. 

You must configure HBase and Ganglia when you launch the cluster; Ganglia reporting cannot be added to a running cluster. 

Ganglia also stores log files on the server at `/mnt/var/log/ganglia/rrds`. If you configured your cluster to persist log files to an Amazon S3 bucket, the Ganglia log files are persisted there as well. 

To launch a cluster with Ganglia for HBase, use the **configure-hbase-for-ganglia** bootstrap action as shown in the following example.

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
--applications Name=Hue Name=Hive Name=Pig Name=HBase Name=Ganglia \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-type c1.xlarge --instance-count 3 --termination-protected \
--bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase-for-ganglia
```

After the cluster is launched with Ganglia configured, you can access the Ganglia graphs and reports using the graphical interface running on the master node. 

# Pig application specifics for earlier AMI versions of Amazon EMR
Pig

## Supported Pig versions


The Pig version you can add to your cluster depends on the version of the Amazon EMR AMI and the version of Hadoop you are using. The table below shows which AMI versions and versions of Hadoop are compatible with the different versions of Pig. We recommend using the latest available version of Pig to take advantage of performance enhancements and new functionality. 

When you use the API to install Pig, the default version is used unless you specify `--pig-versions` as an argument to the step that loads Pig onto the cluster during the call to [RunJobFlow](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_RunJobFlow.html). 


| Pig version | AMI version | Configuration parameters | Pig version details | 
| --- | --- | --- | --- | 
| <a name="pig12"></a>0.12.0[Release notes](http://pig.apache.org/releases.html#14+October%2C+2013%3A+release+0.12.0+available)[Documentation](http://pig.apache.org/docs/r0.12.0/) | 3.1.0 and later |  `--ami-version 3.1` `--ami-version 3.2` `--ami-version 3.3`  |  Adds support for the following: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-3x-pig.html)  | 
| <a name="pig1111"></a>0.11.1.1[Release notes](http://pig.apache.org/releases.html#1+April%2C+2013%3A+release+0.11.1+available)[Documentation](http://pig.apache.org/docs/r0.11.1/) | 2.2 and later |  `--pig-versions 0.11.1.1` `--ami-version 2.2`  |  Improves performance of LOAD command with PigStorage if input resides in Amazon S3.  | 
| <a name="pig0111"></a>0.11.1[Release notes](http://pig.apache.org/releases.html#1+April%2C+2013%3A+release+0.11.1+available)[Documentation](http://pig.apache.org/docs/r0.11.1/) | 2.2 and later |  `--pig-versions 0.11.1` `--ami-version 2.2`  |  Adds support for JDK 7, Hadoop 2, Groovy user-defined functions, SchemaTuple optimization, new operators, and more. For more information, see [Pig 0.11.1 change log](http://svn.apache.org/repos/asf/pig/tags/release-0.11.1/CHANGES.txt).  | 
| <a name="pig0922"></a>0.9.2.2[Release notes](http://pig.apache.org/releases.html#22+January%2C+2012%3A+release+0.9.2+available)[Documentation](http://pig.apache.org/docs/r0.9.2/index.html) | 2.2 and later |  `--pig-versions 0.9.2.2` `--ami-version 2.2`  |  Adds support for Hadoop 1.0.3.  | 
| <a name="pig0921"></a>0.9.2.1[Release notes](http://pig.apache.org/releases.html#22+January%2C+2012%3A+release+0.9.2+available)[Documentation](http://pig.apache.org/docs/r0.9.2/index.html) | 2.2 and later |  `--pig-versions 0.9.2.1` `--ami-version 2.2`  |  Adds support for MapR.  | 
| <a name="pig092"></a>0.9.2[Release notes](http://pig.apache.org/releases.html#22+January%2C+2012%3A+release+0.9.2+available)[Documentation](http://pig.apache.org/docs/r0.9.2/index.html) | 2.2 and later |  `--pig-versions 0.9.2` `--ami-version 2.2`  |  Includes several performance improvements and bug fixes. For complete information about the changes for Pig 0.9.2, go to the [Pig 0.9.2 change log](http://svn.apache.org/repos/asf/pig/tags/release-0.9.2/CHANGES.txt).  | 
| <a name="pig091"></a>0.9.1[Release notes](http://pig.apache.org/releases.html#5+October%2C+2011%3A+release+0.9.1+available)[Documentation](http://pig.apache.org/docs/r0.9.1/) | 2.0 |  `--pig-versions 0.9.1` `--ami-version 2.0`  | 
| <a name="pig06"></a>0.6[Release notes](http://pig.apache.org/releases.html#1+March%2C+2010%3A+release+0.6.0+available) | 1.0 |  `--pig-versions 0.6` `--ami-version 1.0`  | 
| <a name="pig03"></a>0.3[Release notes](http://pig.apache.org/releases.html#25+June%2C+2009%3A+release+0.3.0+available) | 1.0 |  `--pig-versions 0.3` `--ami-version 1.0`  | 

## Pig version details


Amazon EMR supports certain Pig releases that might have additional Amazon EMR patches applied. You can configure which version of Pig to run on Amazon EMR clusters. For more information about how to do this, see [Apache Pig](emr-pig.md). The following sections describe different Pig versions and the patches applied to the versions loaded on Amazon EMR. 

### Pig patches


This section describes the custom patches applied to Pig versions available with Amazon EMR.

#### Pig 0.11.1.1 patches


The Amazon EMR version of Pig 0.11.1.1 is a maintenance release that improves performance of LOAD command with PigStorage if the input resides in Amazon S3.

#### Pig 0.11.1 patches


The Amazon EMR version of Pig 0.11.1 contains all the updates provided by the Apache Software Foundation and the cumulative Amazon EMR patches from Pig version 0.9.2.2. However, there are no new Amazon EMR-specific patches in Pig 0.11.1.

#### Pig 0.9.2 patches


Apache Pig 0.9.2 is a maintenance release of Pig. The Amazon EMR team has applied the following patches to the Amazon EMR version of Pig 0.9.2. 


| Patch | Description | 
| --- | --- | 
|  PIG-1429  |   Add the Boolean data type to Pig as a first class data type. For more information, go to [https://issues.apache.org/jira/browse/PIG-1429](https://issues.apache.org/jira/browse/PIG-1429).   **Status:** Committed   **Fixed in Apache Pig Version:** 0.10   | 
|  PIG-1824  |   Support import modules in Jython UDF. For more information, go to [https://issues.apache.org/jira/browse/PIG-1824](https://issues.apache.org/jira/browse/PIG-1824).   **Status:** Committed   **Fixed in Apache Pig Version:** 0.10   | 
|  PIG-2010  |   Bundle registered JARs on the distributed cache. For more information, go to [https://issues.apache.org/jira/browse/PIG-2010](https://issues.apache.org/jira/browse/PIG-2010).   **Status:** Committed   **Fixed in Apache Pig Version:** 0.11   | 
|  PIG-2456  |   Add a \$1/.pigbootup file where the user can specify default Pig statements. For more information, go to [https://issues.apache.org/jira/browse/PIG-2456](https://issues.apache.org/jira/browse/PIG-2456).   **Status:** Committed   **Fixed in Apache Pig Version:** 0.11   | 
|  PIG-2623  |   Support using Amazon S3 paths to register UDFs. For more information, go to [https://issues.apache.org/jira/browse/PIG-2623](https://issues.apache.org/jira/browse/PIG-2623).   **Status:** Committed   **Fixed in Apache Pig Version:** 0.10, 0.11   | 

#### Pig 0.9.1 patches


The Amazon EMR team has applied the following patches to the Amazon EMR version of Pig 0.9.1. 


| Patch | Description | 
| --- | --- | 
|  Support JAR files and Pig scripts in dfs  |   Add support for running scripts and registering JAR files stored in HDFS, Amazon S3, or other distributed file systems. For more information, go to [https://issues.apache.org/jira/browse/PIG-1505](https://issues.apache.org/jira/browse/PIG-1505).   **Status:** Committed   **Fixed in Apache Pig Version:** 0.8.0   | 
|  Support multiple file systems in Pig  |   Add support for Pig scripts to read data from one file system and write it to another. For more information, go to [https://issues.apache.org/jira/browse/PIG-1564](https://issues.apache.org/jira/browse/PIG-1564).   **Status:** Not Committed   **Fixed in Apache Pig Version:** n/a   | 
|  Add Piggybank datetime and string UDFs  |   Add datetime and string UDFs to support custom Pig scripts. For more information, go to [https://issues.apache.org/jira/browse/PIG-1565](https://issues.apache.org/jira/browse/PIG-1565).   **Status:** Not Committed   **Fixed in Apache Pig Version:** n/a   | 

## Interactive and batch Pig clusters


Amazon EMR enables you to run Pig scripts in two modes:
+ Interactive
+ Batch

When you launch a long-running cluster using the console or the AWS CLI, you can connect using **ssh** into the master node as the Hadoop user and use the Grunt shell to develop and run your Pig scripts interactively. Using Pig interactively enables you to revise the Pig script more easily than batch mode. After you successfully revise the Pig script in interactive mode, you can upload the script to Amazon S3 and use batch mode to run the script in production. You can also submit Pig commands interactively on a running cluster to analyze and transform data as needed.

In batch mode, you upload your Pig script to Amazon S3, and then submit the work to the cluster as a step. Pig steps can be submitted to a long-running cluster or a transient cluster.

# Spark application specifics with earlier AMI versions of Amazon EMR
Spark

## Use Spark interactively or in batch mode


Amazon EMR enables you to run Spark applications in two modes: 
+ Interactive
+ Batch

When you launch a long-running cluster using the console or the AWS CLI, you can connect using SSH into the master node as the Hadoop user and use the Spark shell to develop and run your Spark applications interactively. Using Spark interactively enables you to prototype or test Spark applications more easily than in a batch environment. After you successfully revise the Spark application in interactive mode, you can put that application JAR or Python program in the file system local to the master node of the cluster on Amazon S3. You can then submit the application as a batch workflow.

In batch mode, upload your Spark script to Amazon S3 or the local master node file system, and then submit the work to the cluster as a step. Spark steps can be submitted to a long-running cluster or a transient cluster.

## Creating a cluster with Spark installed


**To launch a cluster with Spark installed using the console**

1. Navigate to the new Amazon EMR console and select **Switch to the old console** from the side navigation. For more information on what to expect when you switch to the old console, see [Using the old console](https://docs.aws.amazon.com/emr/latest/ManagementGuide/whats-new-in-console.html#console-opt-in).

1. Choose **Create cluster**.

1. For **Software Configuration**, choose the AMI release version that you require.

1.  For **Applications to be installed**, choose **Spark** from the list, then choose **Configure and add**.

1. Add arguments to change the Spark configuration as desired. For more information, see [Configure Spark](#emr-3x-spark-configure). Choose **Add**.

1.  Select other options as necessary and then choose **Create cluster**.

The following example shows how to create a cluster with Spark using Java:

```
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
SupportedProductConfig sparkConfig = new SupportedProductConfig()
			.withName("Spark");

RunJobFlowRequest request = new RunJobFlowRequest()
			.withName("Spark Cluster")
			.withAmiVersion("3.11.0")
			.withNewSupportedProducts(sparkConfig)
			.withInstances(new JobFlowInstancesConfig()
				.withEc2KeyName("myKeyName")
				.withInstanceCount(1)
				.withKeepJobFlowAliveWhenNoSteps(true)
				.withMasterInstanceType("m3.xlarge")
				.withSlaveInstanceType("m3.xlarge")
			);			
RunJobFlowResult result = emr.runJobFlow(request);
```

## Configure Spark


You configure Spark when you create a cluster by running the bootstrap action located at [awslabs/emr-bootstrap-actions/spark repository on Github](https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark). For arguments that the bootstrap action accepts, see the [README](https://github.com/aws-samples/emr-bootstrap-actions/blob/master/spark/examples/README.md)in that repository. The bootstrap action configures properties in the `$SPARK_CONF_DIR/spark-defaults.conf` file. For more information about settings, see the Spark Configuration topic in Spark documentation. You can replace "latest" in the following URL with the version number of Spark that you are installing, for example, `2.2.0` [http://spark.apache.org/docs/latest/configuration.html](http://spark.apache.org/docs/latest/configuration.html).

You can also configure Spark dynamically at the time of each application submission. A setting to automatically maximize the resource allocation for an executor is available using the `spark` configuration file. For more information, see [Overriding Spark default configuration settings](#emr-3x-spark-dynamic-configuration).

### Changing Spark default settings


The following example shows how to create a cluster with `spark.executor.memory` set to 2G using the AWS CLI.

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "Spark cluster" --ami-version 3.11.0 \
--applications Name=Spark, Args=[-d,spark.executor.memory=2G] --ec2-attributes KeyName=myKey \
--instance-type m3.xlarge --instance-count 3 --use-default-roles
```

### Submit work to Spark


To submit work to a cluster, use a step to run the `spark-submit` script on your EMR cluster. Add the step using the `addJobFlowSteps` method in [AmazonElasticMapReduceClient](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/AmazonElasticMapReduceClient.html):

```
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
StepFactory stepFactory = new StepFactory();
AddJobFlowStepsRequest req = new AddJobFlowStepsRequest();
req.withJobFlowId("j-1K48XXXXXXHCB");

List<StepConfig> stepConfigs = new ArrayList<StepConfig>();
		
StepConfig sparkStep = new StepConfig()
	.withName("Spark Step")
	.withActionOnFailure("CONTINUE")
	.withHadoopJarStep(stepFactory.newScriptRunnerStep("/home/hadoop/spark/bin/spark-submit","--class","org.apache.spark.examples.SparkPi","/home/hadoop/spark/lib/spark-examples-1.3.1-hadoop2.4.0.jar","10"));

stepConfigs.add(sparkStep);
req.withSteps(stepConfigs);
AddJobFlowStepsResult result = emr.addJobFlowSteps(req);
```

### Overriding Spark default configuration settings


You may want to override Spark default configuration values on a per-application basis. You can do this when you submit applications using a step, which essentially passes options to `spark-submit`. For example, you may wish to change the memory allocated to an executor process by changing `spark.executor.memory`. You can supply the `--executor-memory` switch with an argument like the following:

```
/home/hadoop/spark/bin/spark-submit --executor-memory 1g --class org.apache.spark.examples.SparkPi /home/hadoop/spark/lib/spark-examples*.jar 10
```

Similarly, you can tune `--executor-cores` and `--driver-memory`. In a step, you would provide the following arguments to the step:

```
--executor-memory 1g --class org.apache.spark.examples.SparkPi /home/hadoop/spark/lib/spark-examples*.jar 10
```

You can also tune settings that may not have a built-in switch using the `--conf` option. For more information about other settings that are tunable, see the [Dynamically loading Spark properties](https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties) topic in the Apache Spark documentation.

# S3DistCp utility differences with earlier AMI versions of Amazon EMR
S3DistCp

## S3DistCp versions supported in Amazon EMR


The following S3DistCp versions are supported in Amazon EMR AMI releases. S3DistCp versions after 1.0.7 are found on directly on the clusters. Use the JAR in `/home/hadoop/lib` for the latest features.


| Version | Description | Release date | 
| --- | --- | --- | 
| 1.0.8 | Adds the --appendToLastFile, --requirePreviousManifest, and --storageClass options. | 3 January 2014 | 
| 1.0.7 | Adds the --s3ServerSideEncryption option. | 2 May 2013 | 
| 1.0.6 | Adds the --s3Endpoint option. | 6 August 2012 | 
| 1.0.5 | Improves the ability to specify which version of S3DistCp to run. | 27 June 2012 | 
| 1.0.4 | Improves the --deleteOnSuccess option. | 19 June 2012 | 
| 1.0.3 | Adds support for the --numberFiles and --startingIndex options. | 12 June 2012 | 
| 1.0.2 | Improves file naming when using groups. | 6 June 2012 | 
| 1.0.1 | Initial release of S3DistCp. | 19 January 2012 | 

## Add an S3DistCp copy step to a cluster


To add an S3DistCp copy step to a running cluster, type the following command, replace *j-3GYXXXXXX9IOK* with your cluster ID, and replace *amzn-s3-demo-bucket* with your Amazon S3 bucket name.

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr add-steps --cluster-id j-3GYXXXXXX9IOK \
--steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\
Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com",\
"--src,s3://amzn-s3-demo-bucket/logs/j-3GYXXXXXX9IOJ/node/",\
"--dest,hdfs:///output",\
"--srcPattern,.*[a-zA-Z,]+"]
```

**Example Load Amazon CloudFront logs into HDFS**  
This example loads Amazon CloudFront logs into HDFS by adding a step to a running cluster. In the process, it changes the compression format from Gzip (the CloudFront default) to LZO. This is useful because data compressed using LZO can be split into multiple maps as it is decompressed, so you don't have to wait until the compression is complete, as you do with Gzip. This provides better performance when you analyze the data using Amazon EMR. This example also improves performance by using the regular expression specified in the `--groupBy` option to combine all of the logs for a given hour into a single file. Amazon EMR clusters are more efficient when processing a few, large, LZO-compressed files than when processing many, small, Gzip-compressed files. To split LZO files, you must index them and use the hadoop-lzo third-party library.   
To load Amazon CloudFront logs into HDFS, type the following command, replace *j-3GYXXXXXX9IOK* with your cluster ID, and replace *amzn-s3-demo-bucket* with your Amazon S3 bucket name.   
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr add-steps --cluster-id j-3GYXXXXXX9IOK \
--steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\
Args=["--src,s3://amzn-s3-demo-bucket/cf","--dest,hdfs:///local",\
"--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*",\
"--targetSize,128",
"--outputCodec,lzo","--deleteOnSuccess"]
```
Consider the case in which the preceding example is run over the following CloudFront log files.   

```
s3://amzn-s3-demo-bucket/cf/XABCD12345678.2012-02-23-01.HLUS3JKx.gz
s3://amzn-s3-demo-bucket/cf/XABCD12345678.2012-02-23-01.I9CNAZrg.gz
s3://amzn-s3-demo-bucket/cf/XABCD12345678.2012-02-23-02.YRRwERSA.gz
s3://amzn-s3-demo-bucket/cf/XABCD12345678.2012-02-23-02.dshVLXFE.gz
s3://amzn-s3-demo-bucket/cf/XABCD12345678.2012-02-23-02.LpLfuShd.gz
```
S3DistCp copies, concatenates, and compresses the files into the following two files, where the file name is determined by the match made by the regular expression.   

```
hdfs:///local/2012-02-23-01.lzo
hdfs:///local/2012-02-23-02.lzo
```