Menu
Amazon Elastic MapReduce
Amazon EMR Release Guide

Differences Introduced in 4.x

AWS has made a series of changes to Amazon EMR releases that introduce differences between previous versions and the 4.x releases. The scope of changes range from how you create and configure your cluster to the ports and directory structure of applications on your cluster. The following sections detail these changes.

AMI Version vs. Release Label

Before Amazon EMR release 4.0.0, Amazon EMR software was referenced by its AMI versions. With Amazon EMR release 4.0.0 and later, releases are now referenced by their release label.

The following are ways of specifying release:

Console

Previously, in Version you chose the AMI Version and still do for 2.x and 3.x releases.

Choose EMR release for 4.x or later releases.

CLI

For AMI version releases 2.x and 3.x, specify --ami-version 3.x.x.

For EMR releases emr-4.0.0 or later, use --release-label emr-4.x.x.

API and SDK

In the API, you provide either AmiVersion or ReleaseLabel depending on the respective releases.

In the Java SDK, the following RunJobFlowRequest call specifies an AMI version:

RunJobFlowRequest request = new RunJobFlowRequest()
			.withName("AmiVersion Cluster")
			.withAmiVersion("3.11.0")
			.withInstances(new JobFlowInstancesConfig()
				.withEc2KeyName("myKeyPair")
				.withInstanceCount(1)
				.withKeepJobFlowAliveWhenNoSteps(true)
				.withMasterInstanceType("m3.xlarge")
				.withSlaveInstanceType("m3.xlarge");

The following RunJobFlowRequest call uses a release label instead:

RunJobFlowRequest request = new RunJobFlowRequest()
			.withName("ReleaseLabel Cluster")
			.withReleaseLabel("emr-4.7.2")
			.withInstances(new JobFlowInstancesConfig()
				.withEc2KeyName("myKeyPair")
				.withInstanceCount(1)
				.withKeepJobFlowAliveWhenNoSteps(true)
				.withMasterInstanceType("m3.xlarge")
				.withSlaveInstanceType("m3.xlarge");

Installing Applications on the Cluster

In AMI versions 2.x and 3.x, applications were installed in any number of ways including: the NewSupportedProducts parameter for the RunJobFlow action, using bootstrap actions, and the Step action. With Amazon EMR release 4.x, there is a new, simpler way to install applications on your cluster:

Console

On the Quick Create page, applications are grouped in bundles. In releases 4.x or greater, you can choose from All Applications (Hadoop, Spark, Pig, Hive, and Mahout), Core Hadoop (Hadoop, Hive, and Pig), and Spark (Spark and Hadoop-YARN). On the Advanced cluster configuration page, you can select the exact applications to install, and optionally edit the default configuration for each application. For more information about editing application configurations, see Configuring Applications.

CLI

Installing applications is not changed when you use the CLI, although you no longer provide application configuration with the Args parameter. Instead, use the Configurations parameter to provide the path to a JSON-formatted file containing a set of configuration objects. You can store the file locally or in Amazon S3. For more information, see Configuring Applications.

Java SDK

The preferred way to install applications using Java is to supply a list of applications to RunJobFlowRequest. Using the AWS SDK for Java, this looks like the following:


List<Application> myApps = new ArrayList<Application>();

Application hive = new Application();
hive.withName("Hive");
myApps.add(hive);
	    
Application spark = new Application();
spark.withName("Spark");
myApps.add(spark);

Application mahout = new Application();
mahout.withName("Mahout");
myApps.add(mahout);

RunJobFlowRequest request = new RunJobFlowRequest()
	.withName("My EMR Cluster")
	.withReleaseLabel("emr-4.7.2")
	.withApplications(myApps)
	.withInstances(new JobFlowInstancesConfig()
    	.withEc2KeyName("myKeyName")
    	.withInstanceCount(1)
    	.withKeepJobFlowAliveWhenNoSteps(true)
    	.withMasterInstanceType("m3.xlarge")
    	.withSlaveInstanceType("m3.xlarge");
);

Configurations Replace Predefined Bootstrap Actions

Application configuration is simplified with emr-4.x. Every application that you are able to specify in the Applications parameter supplied to the RunJobFlow action can be configured using a Configuration object. Furthermore, native applications are no longer configured by bootstrap actions but by configurations. For example, this method replaces the configure-hadoop and configure-daemons bootstrap actions, which were used to configure certain applications. Hadoop and/or YARN specific environment properties like --namenode-heap-size are now configured using the hadoop-env and yarn-env classifications. Configuration objects consist of a classification, properties, and optional nested configurations. A classification refers to an application-specific configuration file. Properties are the settings you want to change in that file. You typically supply configurations in a list, allowing you to edit multiple configuration files in one JSON object.

Important

If you try to use one of the previous bootstrap actions supported by Amazon EMR, this causes a webservice error when attempting to launch with releases greater than emr-4.0.0. Custom bootstrap actions that do not attempt to configure native applications continue to work. The following bootstrap actions are no longer supported: configure-daemons, configure-hadoop, and s3Get.

Instead of using the s3get bootstrap action to copy objects to each node, use a custom bootstrap action which runs the AWS CLI on each node. The syntax would look like the following:

aws s3 cp s3://mybucket/myfolder/myobject myFolder/

In the AWS SDK for Java, use ScriptBootstrapActionConfig:

ScriptBootstrapActionConfig s3Config = new ScriptBootstrapActionConfig()
	.withPath("file:///usr/bin/aws")
	.withArgs("s3", "cp","s3://mybucket/myfolder/myobject","myFolder/");

With the AWS CLI, you can launch the cluster with the bootstrap action using the following command:

aws emr create-cluster --release-label emr-4.7.2 \
--instance-type m3.xlarge --instance-count 1 --bootstrap-actions Path=file:///usr/bin/aws,Name="copyToAll",Args="s3","cp","s3://mybucket/myfolder/myobject","myFolder/" --use-default-roles

Note

For Windows, replace the above Linux line continuation character (\) with the caret (^).

Bootstrap actions that were previously used to configure Hadoop and other applications are replaced by configurations. The following tables give classifications for components and applications and corollary bootstrap actions for each, where applicable. If the classification matches a file documented in the application project, see that respective project documentation for more details.

Hadoop

FilenameAMI version bootstrap actionRelease label classification
core-site.xml configure-hadoop -c core-site
log4j.properties configure-hadoop -l hadoop-log4j
hdfs-site.xml configure-hadoop -s hdfs-site
n/an/ahdfs-encryption-zones
mapred-site.xml configure-hadoop -m mapred-site
yarn-site.xml configure-hadoop -y yarn-site
httpfs-site.xml configure-hadoop -t httpfs-site
capacity-scheduler.xml configure-hadoop -z capacity-scheduler
yarn-env.sh configure-daemons --resourcemanager-optsyarn-env

Hive

FilenameAMI version bootstrap actionRelease label classification
hive-env.shn/ahive-env
hive-site.xmlhive-script --install-hive-site ${MY_HIVE_SITE_FILE}hive-site
hive-exec-log4j.propertiesn/ahive-exec-log4j
hive-log4j.propertiesn/ahive-log4j

EMRFS

FilenameAMI version bootstrap actionRelease label classification
emrfs-site.xmlconfigure-hadoop -eemrfs-site
n/as3get -s s3://custom-provider.jar -d /usr/share/aws/emr/auxlib/emrfs-site (with new setting fs.s3.cse.encryptionMaterialsProvider.uri)

For a list of all classifications, see Configuring Applications

Install Steps Are Deprecated

Certain predefined steps, such as those used to install Hive and Pig, are deprecated. Use the configuration interface instead.

Application Environment

In Amazon EMR AMI versions 2.x and 3.x, there was a hadoop-user-env.sh script which was not part of standard Hadoop and was used along with the configure-daemons bootstrap action to configure the Hadoop environment. The script included the following actions:

#!/bin/bash 
export HADOOP_USER_CLASSPATH_FIRST=true; 
echo "HADOOP_CLASSPATH=/path/to/my.jar" >> /home/hadoop/conf/hadoop-user-env.sh

In Amazon EMR release 4.x, you can do the same now with the hadoop-env configurations:

[ 
      { 
         "Classification":"hadoop-env",
         "Properties":{ 

         },
         "Configurations":[ 
            { 
               "Classification":"export",
               "Properties":{ 
                  "HADOOP_USER_CLASSPATH_FIRST":"true",
                  "HADOOP_CLASSPATH":"/path/to/my.jar"
               }
            }
         ]
      }
   ]

You may have previously used a bootstrap action configure-daemons to pass the environment. For example, if you set --namenode-heap-size=2048 and --namenode-opts=-XX:GCTimeRatio=19 with configure-daemons, the equivalent JSON would look like the following:

[ 
      { 
         "Classification":"hadoop-env",
         "Properties":{ 

         },
         "Configurations":[ 
            { 
               "Classification":"export",
               "Properties":{ 
                  "HADOOP_DATANODE_HEAPSIZE":  "2048",
           	"HADOOP_NAMENODE_OPTS":  "-XX:GCTimeRatio=19"
               }
            }
         ]
      }
   ]

Other application environment variables are no longer defined in /home/hadoop/.bashrc. Instead, they are primarily set in /etc/default files per component or application, such as /etc/default/hadoop. Wrapper scripts in /usr/bin/ installed by application RPMs may also set additional environment variables before involving the actual bin script.

Service Ports

In Amazon EMR AMI versions 2.x and 3.x, some services used custom ports. The emr-4.x releases host these services on open source community defaults in most cases.

Changes in Port Settings

SettingAMI Version 3.xRelease Label emr-4.x
fs.default.namehdfs://emrDeterminedIP:9000default (hdfs://emrDeterminedIP:8020)
dfs.datanode.address0.0.0.0:9200default (0.0.0.0:50010)
dfs.datanode.http.address0.0.0.0:9102default (0.0.0.0:50075)
dfs.datanode.https.address0.0.0.0:9402default (0.0.0.0:50475)
dfs.datanode.ipc.address0.0.0.0:9201default (0.0.0.0:50020)
dfs.http.address0.0.0.0:9101default (0.0.0.0:50070)
dfs.https.address0.0.0.0:9202default (0.0.0.0:50470)
dfs.secondary.http.address0.0.0.0:9104default (0.0.0.0:50090)
yarn.nodemanager.address0.0.0.0:9103default (${yarn.nodemanager.hostname}:0)
yarn.nodemanager.localizer.address 0.0.0.0:9033default (${yarn.nodemanager.hostname}:8040)
yarn.nodemanager.webapp.address0.0.0.0:9035default (${yarn.nodemanager.hostname}:8042)
yarn.resourcemanager.addressemrDeterminedIP:9022default (${yarn.resourcemanager.hostname}:8032)
yarn.resourcemanager.admin.addressemrDeterminedIP:9025default (${yarn.resourcemanager.hostname}:8033)
yarn.resourcemanager.resource-tracker.addressemrDeterminedIP:9023default (${yarn.resourcemanager.hostname}:8031)
yarn.resourcemanager.scheduler.addressemrDeterminedIP:9024default (${yarn.resourcemanager.hostname}:8030)
yarn.resourcemanager.webapp.address0.0.0.0:9026 default (${yarn.resourcemanager.hostname}:8088)
yarn.web-proxy.addressemrDeterminedIP:9046 default (no-value)
yarn.resourcemanager.hostname0.0.0.0 (default) emrDeterminedIP

Note

The term emrDeterminedIP is an IP address that is generated by the Amazon EMR control plane. In the newer version, this convention has been eliminated except for the yarn.resourcemanager.hostname and fs.default.name settings.

Users

In AMI versions 2.x and 3.x, the user hadoop ran all processes and owned all files. In Amazon EMR release 4.x, users exist at the application and component level. For example, the following is a process status that demonstrates this user ownership model:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
hive      6452  0.2  0.7 853684 218520 ?       Sl   16:32   0:13 /usr/lib/jvm/java-openjdk/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=/usr/lib/hadoop
hive      6557  0.2  0.6 849508 202396 ?       Sl   16:32   0:09 /usr/lib/jvm/java-openjdk/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=/usr/lib/hadoop/l
hbase     6716  0.1  1.0 1755516 336600 ?      Sl   Jun21   2:20 /usr/lib/jvm/java-openjdk/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -Xmx1024m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/var/
hbase     6871  0.0  0.7 1672196 237648 ?      Sl   Jun21   0:46 /usr/lib/jvm/java-openjdk/bin/java -Dproc_thrift -XX:OnOutOfMemoryError=kill -9 %p -Xmx1024m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/var/
hdfs      7491  0.4  1.0 1719476 309820 ?      Sl   16:32   0:22 /usr/lib/jvm/java-openjdk/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-ip-10-71-203-213.log -Dhadoo
yarn      8524  0.1  0.6 1626164 211300 ?      Sl   16:33   0:05 /usr/lib/jvm/java-openjdk/bin/java -Dproc_proxyserver -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-yarn-
yarn      8646  1.0  1.2 1876916 385308 ?      Sl   16:33   0:46 /usr/lib/jvm/java-openjdk/bin/java -Dproc_resourcemanager -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-y
mapred    9265  0.2  0.8 1666628 260484 ?      Sl   16:33   0:12 /usr/lib/jvm/java-openjdk/bin/java -Dproc_historyserver -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop

Installation Sequence, Installed Artifact, and Log File Locations

In AMI versions 2.x and 3.x, application artifacts and their configuration directories were previously installed to /home/hadoop/application. For example, if you installed Hive, the directory would be /home/hadoop/hive. In Amazon EMR release 4.0.0 and later, application artifacts are installed in /usr/lib/application, so Hive would be located in /usr/lib/hive. Most configuration files are stored in /etc/application/conf, so hive-site, Hive’s configuration file, would be located at /etc/hive/conf.

Previously, log files were found in various places. In Amazon EMR 4.x and later, they are now all located under /var/log/component.

Locations for log files pushed to Amazon S3 have changed as follows:

Changes in Log Locations on Amazon S3

Daemon or ApplicationAMI 3.xemr-4.0.0
instance-statenode/instance-id/instance-state/node/instance-id/instance-state/
hadoop-hdfs-namenodedaemons/instance-id/hadoop-hadoop-namenode.lognode/instance-id/applications/hadoop-hdfs/hadoop-hdfs-namenode-ip-ipAddress.log
hadoop-hdfs-datanodedaemons/instance-id/hadoop-hadoop-datanode.lognode/instance-id/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-ipAddress.log
hadoop-yarn (ResourceManager)daemons/instance-id/yarn-hadoop-resourcemanagernode/instance-id/applications/hadoop-yarn/yarn-yarn-resourcemanager-ip-ipAddress.log
hadoop-yarn (Proxy Server)daemons/instance-id/yarn-hadoop-proxyservernode/instance-id/applications/hadoop-yarn/yarn-yarn-proxyserver-ip-ipAddress.log
mapred-historyserverdaemons/instance-id/node/instance-id/applications/hadoop-mapreduce/mapred-mapred-historyserver-ip-ipAddress.log
httpfsdaemons/instance-id/httpfs.lognode/instance-id/applications/hadoop-httpfs/httpfs.log
hive-servernode/instance-id/hive-server/hive-server.lognode/instance-id/applications/hive/hive-server.log
hive-metastorenode/instance-id/apps/hive.lognode/instance-id/applications/hive/hive-metastore.log
Hive CLInode/instance-id/apps/hive.lognode/instance-id/applications/hive/tmp/$username/hive.log
YARN applications user logs and container logstask-attempts/containers/
MahoutN/Anode/instance-id/applications/mahout
PigN/Anode/instance-id/applications/pig/pig.log
spark-historyserverN/Anode/instance-id/applications/spark/spark-historyserver.log
mapreduce job history filesjobs/hadoop-mapred/history/

Command Runner

Many scripts or programs, like /home/hadoop/contrib/streaming/hadoop-streaming.jar, are now placed on the shell login path environment so you do not need to specify the full path when executing them when using command-runner.jar. You also do not have to know the full path to command-runner.jar. command-runner.jar is also located on the AMI so there is no need to know a full URI as was the case with script-runner.jar.

The following is a list of scripts that can be executed with command-runner.jar:

hadoop-streaming

Submit a Hadoop streaming program. In the console and some SDKs, this is a streaming step.

hive-script

Run a Hive script. In the console and SDKs, this is a Hive step.

pig-script

Run a Pig script. In the console and SDKs, this is a Pig step.

spark-submit

Run a Spark application. In the console, this is a Spark step.

s3-dist-cp

Distributed copy large amounts of data from Amazon S3 into HDFS.

hadoop-lzo

Run the Hadoop LZO indexer on a directory.

The following is an example usage of command-runner.jar using the AWS CLI:

aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Name="Command Runner",Jar="command-runner.jar",Args=["spark-submit","Args..."]