Menu
Amazon EMR
Amazon EMR Release Guide

Configuring Applications

You can override the default configurations for applications you install by supplying a configuration object when specifying applications you want installed at cluster creation time. Configuration objects consist of a classification, properties, and optional nested configurations. A classification refers to an application-specific configuration file. Properties are the settings you want to change in that file. You typically supply configurations in a list, allowing you to edit multiple configuration files in one JSON list.

Example JSON for a list of configurations is provided below:

[
  {
    "Classification": "core-site",
    "Properties": {
      "hadoop.security.groups.cache.secs": "250"
    }
  },
  {
    "Classification": "mapred-site",
    "Properties": {
      "mapred.tasktracker.map.tasks.maximum": "2",
      "mapreduce.map.sort.spill.percent": "90",
      "mapreduce.tasktracker.reduce.tasks.maximum": "5"
    }
  }
]

The classification usually specifies the file name that you want modified. An exception to this is the deprecated bootstrap action configure-daemons, which is used to set environment parameters such as --namenode-heap-size. Now, options like this are subsumed into the hadoop-env and yarn-env classifications with their own nested export classifications. If any classification ends in "env", you should use the export sub-classification. Another exception is s3get, which was used to place a customer EncryptionMaterialsProvider object on each node in a cluster for use in client-side encryption. An option was added to the emrfs-site classification for this purpose.

An example of the hadoop-env classification is provided below:

[
  {
    "Classification": "hadoop-env",
    "Properties": {
      
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "HADOOP_DATANODE_HEAPSIZE": "2048",
          "HADOOP_NAMENODE_OPTS": "-XX:GCTimeRatio=19"
        },
        "Configurations": [
          
        ]
      }
    ]
  }
]

An example of the yarn-env classification is provided below:

[
  {
    "Classification": "yarn-env",
    "Properties": {
      
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "YARN_RESOURCEMANAGER_OPTS": "-Xdebug -Xrunjdwp:transport=dt_socket"
        },
        "Configurations": [
          
        ]
      }
    ]
  }
]

The following settings do not belong to a configuration file but are used by Amazon EMR to potentially set multiple settings on your behalf.

Amazon EMR-curated Settings

ApplicationRelease label classificationValid propertiesWhen to use
SparksparkmaximizeResourceAllocationConfigure executors to utilize the maximum resources of each node.

The following are all configuration classifications for this release:

Configuration Classifications

ClassificationsDescription

capacity-scheduler

Change values in Hadoop's capacity-scheduler.xml file.

core-site

Change values in Hadoop's core-site.xml file.

emrfs-site

Change EMRFS settings.

flink-conf

Change flink-conf.yaml settings.

flink-log4j

Change Flink log4j.properties settings.

flink-log4j-yarn-session

Change Flink log4j-yarn-session.properties settings.

flink-log4j-cli

Change Flink log4j-cli.properties settings.

hadoop-env

Change values in the Hadoop environment for all Hadoop components.

hadoop-log4j

Change values in Hadoop's log4j.properties file.

hadoop-ssl-server

Change hadoop ssl server configuration

hadoop-ssl-client

Change hadoop ssl client configuration

hbase

Amazon EMR-curated settings for Apache HBase.

hbase-env

Change values in HBase's environment.

hbase-log4j

Change values in HBase's hbase-log4j.properties file.

hbase-metrics

Change values in HBase's hadoop-metrics2-hbaase.properties file.

hbase-policy

Change values in HBase's hbase-policy.xml file.

hbase-site

Change values in HBase's hbase-site.xml file.

hdfs-encryption-zones

Configure HDFS encryption zones.

hdfs-site

Change values in HDFS's hdfs-site.xml.

hcatalog-env

Change values in HCatalog's environment.

hcatalog-server-jndi

Change values in HCatalog's jndi.properties.

hcatalog-server-proto-hive-site

Change values in HCatalog's proto-hive-site.xml.

hcatalog-webhcat-env

Change values in HCatalog WebHCat's environment.

hcatalog-webhcat-log4j2

Change values in HCatalog WebHCat's log4j2.properties.

hcatalog-webhcat-site

Change values in HCatalog WebHCat's webhcat-site.xml file.

hive-beeline-log4j2

Change values in Hive's beeline-log4j2.properties file.

hive-env

Change values in the Hive environment.

hive-exec-log4j2

Change values in Hive's hive-exec-log4j2.properties file.

hive-llap-daemon-log4j2

Change values in Hive's llap-daemon-log4j2.properties file.

hive-log4j2

Change values in Hive's hive-log4j2.properties file.

hive-site

Change values in Hive's hive-site.xml file

hiveserver2-site

Change values in Hive Server2's hiveserver2-site.xml file

hue-ini

Change values in Hue's ini file

httpfs-env

Change values in the HTTPFS environment.

httpfs-site

Change values in Hadoop's httpfs-site.xml file.

hadoop-kms-acls

Change values in Hadoop's kms-acls.xml file.

hadoop-kms-env

Change values in the Hadoop KMS environment.

hadoop-kms-log4j

Change values in Hadoop's kms-log4j.properties file.

hadoop-kms-site

Change values in Hadoop's kms-site.xml file.

mapred-env

Change values in the MapReduce application's environment.

mapred-site

Change values in the MapReduce application's mapred-site.xml file.

oozie-env

Change values in Oozie's environment.

oozie-log4j

Change values in Oozie's oozie-log4j.properties file.

oozie-site

Change values in Oozie's oozie-site.xml file.

phoenix-hbase-metrics

Change values in Phoenix's hadoop-metrics2-hbase.properties file.

phoenix-hbase-site

Change values in Phoenix's hbase-site.xml file.

phoenix-log4j

Change values in Phoenix's log4j.properties file.

phoenix-metrics

Change values in Phoenix's hadoop-metrics2-phoenix.properties file.

pig-properties

Change values in Pig's pig.properties file.

pig-log4j

Change values in Pig's log4j.properties file.

presto-log

Change values in Presto's log.properties file.

presto-config

Change values in Presto's config.properties file.

presto-connector-blackhole

Change values in Presto's blackhole.properties file.

presto-connector-cassandra

Change values in Presto's cassandra.properties file.

presto-connector-hive

Change values in Presto's hive.properties file.

presto-connector-jmx

Change values in Presto's jmx.properties file.

presto-connector-kafka

Change values in Presto's kafka.properties file.

presto-connector-localfile

Change values in Presto's localfile.properties file.

presto-connector-mongodb

Change values in Presto's mongodb.properties file.

presto-connector-mysql

Change values in Presto's mysql.properties file.

presto-connector-postgresql

Change values in Presto's postgresql.properties file.

presto-connector-raptor

Change values in Presto's raptor.properties file.

presto-connector-redis

Change values in Presto's redis.properties file.

presto-connector-tpch

Change values in Presto's tpch.properties file.

spark

Amazon EMR-curated settings for Apache Spark.

spark-defaults

Change values in Spark's spark-defaults.conf file.

spark-env

Change values in the Spark environment.

spark-hive-site

Change values in Spark's hive-site.xml file

spark-log4j

Change values in Spark's log4j.properties file.

spark-metrics

Change values in Spark's metrics.properties file.

sqoop-env

Change values in Sqoop's environment.

sqoop-oraoop-site

Change values in Sqoop OraOop's oraoop-site.xml file.

sqoop-site

Change values in Sqoop's sqoop-site.xml file.

tez-site

Change values in Tez's tez-site.xml file.

yarn-env

Change values in the YARN environment.

yarn-site

Change values in YARN's yarn-site.xml file.

zeppelin-env

Change values in the Zeppelin environment.

zookeeper-config

Change values in ZooKeeper's zoo.cfg file.

zookeeper-log4j

Change values in ZooKeeper's log4j.properties file.


Example Supplying a Configuration in the Console

To supply a configuration, you navigate to the Create cluster page and choose Edit software settings. You can then enter the configuration directly (in JSON or using shorthand syntax demonstrated in shadow text) in the console or provide a Amazon S3 URI for a file with a JSON Configurations object.


Example Supplying a Configuration Using the CLI

You can provide a configuration to create-cluster by supplying a path to a JSON file stored locally or in Amazon S3:

aws emr create-cluster --release-label emr-5.2.0 --instance-type m3.xlarge --instance-count 2 --applications Name=Hive --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

If you configuration is in your local directory, you can use the following:

aws emr create-cluster --release-label emr-5.2.0 --applications Name=Hive \
--instance-type m3.xlarge --instance-count 3 --configurations file://./configurations.json

Example Supplying a Configuration Using the Java SDK

The following program excerpt shows how to supply a configuration using the AWS SDK for Java:


Application hive = new Application().withName("Hive");

Map<String,String> hiveProperties = new HashMap<String,String>();
	hiveProperties.put("hive.join.emit.interval","1000");
	hiveProperties.put("hive.merge.mapfiles","true");
	    
Configuration myHiveConfig = new Configuration()
	.withClassification("hive-site")
	.withProperties(hiveProperties);

RunJobFlowRequest request = new RunJobFlowRequest()
	.withName("Create cluster with ReleaseLabel")
	.withReleaseLabel("emr-5.2.0")
	.withApplications(hive)
	.withConfigurations(myHiveConfig)
	.withServiceRole("EMR_DefaultRole")
	.withJobFlowRole("EMR_EC2_DefaultRole")
	.withInstances(new JobFlowInstancesConfig()
		.withEc2KeyName("myKey")
		.withInstanceCount(1)
		.withKeepJobFlowAliveWhenNoSteps(true)
		.withMasterInstanceType("m3.xlarge")
		.withSlaveInstanceType("m3.xlarge")
	);				

Configuring Applications to Use Java 8

You set JAVA_HOME for an application by supplying the setting to its environment classification, application-env. For Hadoop and Hive, this would look like:

[
    {
        "Classification": "hadoop-env", 
        "Configurations": [
            {
                "Classification": "export", 
                "Configurations": [], 
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ], 
        "Properties": {}
    }
]

For Spark, if you are writing a driver for submission in cluster mode, the driver will use Java 7 but setting the environment can ensure that the executors use Java 8. To do this, we recommend setting both Hadoop and Spark classifications:

[
    {
        "Classification": "hadoop-env", 
        "Configurations": [
            {
                "Classification": "export", 
                "Configurations": [], 
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ], 
        "Properties": {}
    }, 
    {
        "Classification": "spark-env", 
        "Configurations": [
            {
                "Classification": "export", 
                "Configurations": [], 
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ], 
        "Properties": {}
    }
]

Service ports

The following are YARN and HDFS service ports. These settings reflect Hadoop defaults. Other application services are hosted at default ports unless otherwise documented. Please see the application's project documentation for further information.

Port Settings for YARN and HDFS

SettingHostname/Port
fs.default.namedefault (hdfs://emrDeterminedIP:8020)
dfs.datanode.addressdefault (0.0.0.0:50010)
dfs.datanode.http.addressdefault (0.0.0.0:50075)
dfs.datanode.https.addressdefault (0.0.0.0:50475)
dfs.datanode.ipc.addressdefault (0.0.0.0:50020)
dfs.http.addressdefault (0.0.0.0:50070)
dfs.https.addressdefault (0.0.0.0:50470)
dfs.secondary.http.addressdefault (0.0.0.0:50090)
yarn.nodemanager.addressdefault (${yarn.nodemanager.hostname}:0)
yarn.nodemanager.localizer.address default (${yarn.nodemanager.hostname}:8040)
yarn.nodemanager.webapp.addressdefault (${yarn.nodemanager.hostname}:8042)
yarn.resourcemanager.addressdefault (${yarn.resourcemanager.hostname}:8032)
yarn.resourcemanager.admin.addressdefault (${yarn.resourcemanager.hostname}:8033)
yarn.resourcemanager.resource-tracker.addressdefault (${yarn.resourcemanager.hostname}:8031)
yarn.resourcemanager.scheduler.addressdefault (${yarn.resourcemanager.hostname}:8030)
yarn.resourcemanager.webapp.addressdefault (${yarn.resourcemanager.hostname}:8088)
yarn.web-proxy.addressdefault (no-value)
yarn.resourcemanager.hostnameemrDeterminedIP

Note

The term emrDeterminedIP is an IP address that is generated by the Amazon EMR control plane. In the newer version, this convention has been eliminated except for the yarn.resourcemanager.hostname and fs.default.name settings.

Application users

Applications will run processes as their own user. For example, Hive JVMs will run as user hive, MapReduce JVMs will run as mapred, and so on. The following process status demonstrates this:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
hive      6452  0.2  0.7 853684 218520 ?       Sl   16:32   0:13 /usr/lib/jvm/java-openjdk/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=/usr/lib/hadoop
hive      6557  0.2  0.6 849508 202396 ?       Sl   16:32   0:09 /usr/lib/jvm/java-openjdk/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=/usr/lib/hadoop/l
hbase     6716  0.1  1.0 1755516 336600 ?      Sl   Jun21   2:20 /usr/lib/jvm/java-openjdk/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -Xmx1024m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/var/
hbase     6871  0.0  0.7 1672196 237648 ?      Sl   Jun21   0:46 /usr/lib/jvm/java-openjdk/bin/java -Dproc_thrift -XX:OnOutOfMemoryError=kill -9 %p -Xmx1024m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/var/
hdfs      7491  0.4  1.0 1719476 309820 ?      Sl   16:32   0:22 /usr/lib/jvm/java-openjdk/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-ip-10-71-203-213.log -Dhadoo
yarn      8524  0.1  0.6 1626164 211300 ?      Sl   16:33   0:05 /usr/lib/jvm/java-openjdk/bin/java -Dproc_proxyserver -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-yarn-
yarn      8646  1.0  1.2 1876916 385308 ?      Sl   16:33   0:46 /usr/lib/jvm/java-openjdk/bin/java -Dproc_resourcemanager -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-y
mapred    9265  0.2  0.8 1666628 260484 ?      Sl   16:33   0:12 /usr/lib/jvm/java-openjdk/bin/java -Dproc_historyserver -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop