Application environment variables Service ports Users Installation sequence, installed artifacts, and log file locations Command runner Replication factor

Customizing cluster and application configuration with earlier AMI versions of Amazon EMR

Amazon EMR release version 4.0.0 introduced a simplified method of configuring applications using configuration classifications. For more information, see Configure applications. When using an AMI version, you configure applications using bootstrap actions along with arguments that you pass. For example, the configure-hadoop and configure-daemons bootstrap actions set Hadoop and YARN–specific environment properties like --namenode-heap-size. In more recent versions, these are configured using the hadoop-env and yarn-env configuration classifications. For bootstrap actions that perform common configurations, see the emr-bootstrap-actions repository on Github.

The following tables map bootstrap actions to configuration classifications in more recent Amazon EMR release versions.

Hadoop
Affected application file name	AMI version bootstrap action	Configuration classification
`core-site.xml`	`configure-hadoop -c`	`core-site`
`log4j.properties`	`configure-hadoop -l`	`hadoop-log4j`
`hdfs-site.xml`	`configure-hadoop -s`	`hdfs-site`
n/a	n/a	`hdfs-encryption-zones`
`mapred-site.xml`	`configure-hadoop -m`	`mapred-site`
`yarn-site.xml`	`configure-hadoop -y`	`yarn-site`
`httpfs-site.xml`	`configure-hadoop -t`	`httpfs-site`
`capacity-scheduler.xml`	`configure-hadoop -z`	`capacity-scheduler`
`yarn-env.sh`	`configure-daemons --resourcemanager-opts`	`yarn-env`

Hive
Affected application file name	AMI version bootstrap action	Configuration classification
`hive-env.sh`	n/a	`hive-env`
`hive-site.xml`	`hive-script --install-hive-site ${MY_HIVE_SITE_FILE}`	`hive-site`
`hive-exec-log4j.properties`	n/a	`hive-exec-log4j`
`hive-log4j.properties`	n/a	`hive-log4j`

EMRFS
Affected application file name	AMI version bootstrap action	Configuration classification
`emrfs-site.xml`	`configure-hadoop -e`	`emrfs-site`
n/a	`s3get -s s3://custom-provider.jar -d /usr/share/aws/emr/auxlib/`	`emrfs-site` (with new setting `fs.s3.cse.encryptionMaterialsProvider.uri`)

For a list of all classifications, see Configure applications.

Application environment variables

When using an AMI version, a hadoop-user-env.sh script is used along with the configure-daemons bootstrap action to configure the Hadoop environment. The script includes the following actions:


#!/bin/bash 
export HADOOP_USER_CLASSPATH_FIRST=true; 
echo "HADOOP_CLASSPATH=/path/to/my.jar" >> /home/hadoop/conf/hadoop-user-env.sh

In Amazon EMR release 4.x, you do the same using the hadoop-env configuration classification, as shown in the following example:


[ 
      { 
         "Classification":"hadoop-env",
         "Properties":{ 

         },
         "Configurations":[ 
            { 
               "Classification":"export",
               "Properties":{ 
                  "HADOOP_USER_CLASSPATH_FIRST":"true",
                  "HADOOP_CLASSPATH":"/path/to/my.jar"
               }
            }
         ]
      }
   ]

As another example, using configure-daemons and passing --namenode-heap-size=2048 and --namenode-opts=-XX:GCTimeRatio=19 is equivalent to the following configuration classifications.


[ 
      { 
         "Classification":"hadoop-env",
         "Properties":{ 

         },
         "Configurations":[ 
            { 
               "Classification":"export",
               "Properties":{ 
                  "HADOOP_DATANODE_HEAPSIZE":  "2048",
           	"HADOOP_NAMENODE_OPTS":  "-XX:GCTimeRatio=19"
               }
            }
         ]
      }
   ]

Other application environment variables are no longer defined in /home/hadoop/.bashrc. Instead, they are primarily set in /etc/default files per component or application, such as /etc/default/hadoop. Wrapper scripts in /usr/bin/ installed by application RPMs may also set additional environment variables before involving the actual bin script.

Service ports

When using an AMI version, some services use custom ports.

Changes in port settings
Setting	AMI version 3.x	Open-source default
fs.default.name	hdfs://emrDeterminedIP:9000	default (hdfs://`emrDeterminedIP`:8020)
dfs.datanode.address	0.0.0.0:9200	default (0.0.0.0:50010)
dfs.datanode.http.address	0.0.0.0:9102	default (0.0.0.0:50075)
dfs.datanode.https.address	0.0.0.0:9402	default (0.0.0.0:50475)
dfs.datanode.ipc.address	0.0.0.0:9201	default (0.0.0.0:50020)
dfs.http.address	0.0.0.0:9101	default (0.0.0.0:50070)
dfs.https.address	0.0.0.0:9202	default (0.0.0.0:50470)
dfs.secondary.http.address	0.0.0.0:9104	default (0.0.0.0:50090)
yarn.nodemanager.address	0.0.0.0:9103	default (${yarn.nodemanager.hostname}:0)
yarn.nodemanager.localizer.address	0.0.0.0:9033	default (${yarn.nodemanager.hostname}:8040)
yarn.nodemanager.webapp.address	0.0.0.0:9035	default (${yarn.nodemanager.hostname}:8042)
yarn.resourcemanager.address	`emrDeterminedIP`:9022	default (${yarn.resourcemanager.hostname}:8032)
yarn.resourcemanager.admin.address	`emrDeterminedIP`:9025	default (${yarn.resourcemanager.hostname}:8033)
yarn.resourcemanager.resource-tracker.address	`emrDeterminedIP`:9023	default (${yarn.resourcemanager.hostname}:8031)
yarn.resourcemanager.scheduler.address	`emrDeterminedIP`:9024	default (${yarn.resourcemanager.hostname}:8030)
yarn.resourcemanager.webapp.address	0.0.0.0:9026	default (${yarn.resourcemanager.hostname}:8088)
yarn.web-proxy.address	`emrDeterminedIP`:9046	default (no-value)
yarn.resourcemanager.hostname	0.0.0.0 (default)	`emrDeterminedIP`

Note

The emrDeterminedIP is an IP address that is generated by Amazon EMR.

Users

When using an AMI version, the user hadoop runs all processes and owns all files. In Amazon EMR release version 4.0.0 and later, users exist at the application and component level.

Installation sequence, installed artifacts, and log file locations

When using an AMI version, application artifacts and their configuration directories are installed in the /home/hadoop/application directory. For example, if you installed Hive, the directory would be /home/hadoop/hive. In Amazon EMR release 4.0.0 and later, application artifacts are installed in the /usr/lib/application directory. When using an AMI version, log files are found in various places. The table below lists locations.

Changes in log locations on Amazon S3
Daemon or application	Directory location
instance-state	node/`instance-id`/instance-state/
hadoop-hdfs-namenode	daemons/`instance-id`/hadoop-hadoop-namenode.log
hadoop-hdfs-datanode	daemons/`instance-id`/hadoop-hadoop-datanode.log
hadoop-yarn (ResourceManager)	daemons/`instance-id`/yarn-hadoop-resourcemanager
hadoop-yarn (Proxy Server)	daemons/`instance-id`/yarn-hadoop-proxyserver
mapred-historyserver	daemons/`instance-id`/
httpfs	daemons/`instance-id`/httpfs.log
hive-server	node/`instance-id`/hive-server/hive-server.log
hive-metastore	node/`instance-id`/apps/hive.log
Hive CLI	node/`instance-id`/apps/hive.log
YARN applications user logs and container logs	task-attempts/
Mahout	N/A
Pig	N/A
spark-historyserver	N/A
mapreduce job history files	jobs/

Command runner

When using an AMI version, many scripts or programs, like /home/hadoop/contrib/streaming/hadoop-streaming.jar, are not placed on the shell login path environment, so you need to specify the full path when you use a jar file such as command-runner.jar or script-runner.jar to execute the scripts. The command-runner.jar is located on the AMI so there is no need to know a full URI as was the case with script-runner.jar.

Replication factor

The replication factor lets you configure when to start a Hadoop JVM. You can start a new Hadoop JVM for every task, which provides better task isolation, or you can share JVMs between tasks, providing lower framework overhead. If you are processing many small files, it makes sense to reuse the JVM many times to amortize the cost of start-up. However, if each task takes a long time or processes a large amount of data, then you might choose to not reuse the JVM to ensure that all memory is freed for subsequent tasks. When using an AMI version, you can customize the replication factor using the configure-hadoop bootstrap action to set the mapred.job.reuse.jvm.num.tasks property.

The following example demonstrates setting the JVM reuse factor for infinite JVM reuse.

Note

Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).


aws emr create-cluster --name "Test cluster" --ami-version 3.11.0 \
--applications Name=Hue Name=Hive Name=Pig \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge \
--bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,\
Name="Configuring infinite JVM reuse",Args=["-m","mapred.job.reuse.jvm.num.tasks=-1"]

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Installing applications

Hive