Menu
Amazon Elastic MapReduce
Amazon EMR Release Guide

Configuring Applications

You can override the default configurations for applications you install by supplying a configuration object when specifying applications you want installed at cluster creation time. Configuration objects consist of a classification, properties, and optional nested configurations. A classification refers to an application-specific configuration file. Properties are the settings you want to change in that file. You typically supply configurations in a list, allowing you to edit multiple configuration files in one JSON list.

Example JSON for a list of configurations is provided below:

[
  {
    "Classification": "core-site",
    "Properties": {
      "hadoop.security.groups.cache.secs": "250"
    }
  },
  {
    "Classification": "mapred-site",
    "Properties": {
      "mapred.tasktracker.map.tasks.maximum": "2",
      "mapreduce.map.sort.spill.percent": "90",
      "mapreduce.tasktracker.reduce.tasks.maximum": "5"
    }
  }
]

The classification usually specifies the file name that you want modified. An exception to this is the deprecated bootstrap action configure-daemons, which is used to set environment parameters such as --namenode-heap-size. Now, options like this are subsumed into the hadoop-env and yarn-env classifications with their own nested export classifications. Another exception is s3get, which was used to place a customer EncryptionMaterialsProvider object on each node in a cluster for use in client-side encryption. An option was added to the emrfs-site classification for this purpose. For information about the relationship between bootstrap actions and configurations see, Configurations Replace Predefined Bootstrap Actions

An example of the hadoop-env classification is provided below:

[
  {
    "Classification": "hadoop-env",
    "Properties": {
      
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "HADOOP_DATANODE_HEAPSIZE": "2048",
          "HADOOP_NAMENODE_OPTS": "-XX:GCTimeRatio=19"
        },
        "Configurations": [
          
        ]
      }
    ]
  }
]

An example of the yarn-env classification is provided below:

[
  {
    "Classification": "yarn-env",
    "Properties": {
      
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "YARN_RESOURCEMANAGER_OPTS": "-Xdebug -Xrunjdwp:transport=dt_socket"
        },
        "Configurations": [
          
        ]
      }
    ]
  }
]

The following settings do not belong to a configuration file but are used by Amazon EMR to potentially set multiple settings on your behalf.

Amazon EMR-curated Settings

ApplicationRelease label classificationValid propertiesWhen to use
SparksparkmaximizeResourceAllocationConfigure executors to utilize the maximum resources of each node.

The following are all configuration classifications for this release:

Configuration Classifications

ClassificationsDescription

capacity-scheduler

Change values in Hadoop's capacity-scheduler.xml file.

core-site

Change values in Hadoop's core-site.xml file.

emrfs-site

Change EMRFS settings.

hadoop-env

Change values in the Hadoop environment for all Hadoop components.

hadoop-log4j

Change values in Hadoop's log4j.properties file.

hbase-env

Change values in HBase's environment.

hbase-log4j

Change values in HBase's hbase-log4j.properties file.

hbase-metrics

Change values in HBase's hadoop-metrics2-hbaase.properties file.

hbase-policy

Change values in HBase's hbase-policy.xml file.

hbase-site

Change values in HBase's hbase-site.xml file.

hdfs-encryption-zones

Configure HDFS encryption zones.

hdfs-site

Change values in HDFS's hdfs-site.xml.

hcatalog-env

Change values in HCatalog's environment.

hcatalog-server-jndi

Change values in HCatalog's jndi.properties.

hcatalog-server-proto-hive-site

Change values in HCatalog's proto-hive-site.xml.

hcatalog-webhcat-env

Change values in HCatalog WebHCat's environment.

hcatalog-webhcat-log4j

Change values in HCatalog WebHCat's log4j.properties.

hcatalog-webhcat-site

Change values in HCatalog WebHCat's webhcat-site.xml file.

hive-env

Change values in the Hive environment.

hive-exec-log4j

Change values in Hive's hive-exec-log4j.properties file.

hive-log4j

Change values in Hive's hive-log4j.properties file.

hive-site

Change values in Hive's hive-site.xml file

hue-ini

Change values in Hue's ini file

httpfs-env

Change values in the HTTPFS environment.

httpfs-site

Change values in Hadoop's httpfs-site.xml file.

hadoop-kms-acls

Change values in Hadoop's kms-acls.xml file.

hadoop-kms-env

Change values in the Hadoop KMS environment.

hadoop-kms-log4j

Change values in Hadoop's kms-log4j.properties file.

hadoop-kms-site

Change values in Hadoop's kms-site.xml file.

mapred-env

Change values in the MapReduce application's environment.

mapred-site

Change values in the MapReduce application's mapred-site.xml file.

oozie-env

Change values in Oozie's environment.

oozie-log4j

Change values in Oozie's oozie-log4j.properties file.

oozie-site

Change values in Oozie's oozie-site.xml file.

pig-properties

Change values in Pig's pig.properties file.

pig-log4j

Change values in Pig's log4j.properties file.

presto-log

Change values in Presto's log.properties file.

presto-config

Change values in Presto's config.properties file.

presto-connector-hive

Change values in Presto's hive.properties file.

spark

Amazon EMR-curated settings for Apache Spark.

spark-defaults

Change values in Spark's spark-defaults.conf file.

spark-env

Change values in the Spark environment.

spark-log4j

Change values in Spark's log4j.properties file.

spark-metrics

Change values in Spark's metrics.properties file.

sqoop-env

Change values in Sqoop's environment.

sqoop-oraoop-site

Change values in Sqoop OraOop's oraoop-site.xml file.

sqoop-site

Change values in Sqoop's sqoop-site.xml file.

yarn-env

Change values in the YARN environment.

yarn-site

Change values in YARN's yarn-site.xml file.

zeppelin-env

Change values in the Zeppelin environment.

zookeeper-config

Change values in ZooKeeper's zoo.cfg file.

zookeeper-log4j

Change values in ZooKeeper's log4j.properties file.


Example Supplying a Configuration in the Console

To supply a configuration, you navigate to the Create cluster page and choose Edit software settings. You can then enter the configuration directly (in JSON or using shorthand syntax demonstrated in shadow text) in the console or provide a Amazon S3 URI for a file with a JSON Configurations object.


Example Supplying a Configuration Using the CLI

You can provide a configuration to create-cluster by supplying a path to a JSON file stored locally or in Amazon S3:

aws emr create-cluster --release-label 
			emr-4.6.0 --instance-type m3.xlarge --instance-count 2 --applications Name=Hive --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

Example Supplying a Configuration Using the Java SDK

The following program excerpt shows how to supply a configuration using the AWS SDK for Java:


Application hive = new Application().withName("Hive");

Map<String,String> hiveProperties = new HashMap<String,String>();
	hiveProperties.put("hive.join.emit.interval","1000");
	hiveProperties.put("hive.merge.mapfiles","true");
	    
Configuration myHiveConfig = new Configuration()
	.withClassification("hive-site")
	.withProperties(hiveProperties);

RunJobFlowRequest request = new RunJobFlowRequest()
	.withName("Create cluster with ReleaseLabel")
	.withReleaseLabel("emr-4.6.0")
	.withApplications(hive)
	.withConfigurations(myHiveConfig)
	.withServiceRole("EMR_DefaultRole")
	.withJobFlowRole("EMR_EC2_DefaultRole")
	.withInstances(new JobFlowInstancesConfig()
		.withEc2KeyName("myKey")
		.withInstanceCount(1)
		.withKeepJobFlowAliveWhenNoSteps(true)
		.withMasterInstanceType("m3.xlarge")
		.withSlaveInstanceType("m3.xlarge")
	);				

Configuring Applications to Use Java 8

You set JAVA_HOME for an application by supplying the setting to its environment classification, application-env. For Hadoop and Hive, this would look like:

[
    {
        "Classification": "hadoop-env", 
        "Configurations": [
            {
                "Classification": "export", 
                "Configurations": [], 
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ], 
        "Properties": {}
    }
]

For Spark, if you are writing a driver for submission in cluster mode, the driver will use Java 7 but setting the environment to can ensure that the executors use Java 8. To do this, we recommend setting both Hadoop and Spark classifications:

[
    {
        "Classification": "hadoop-env", 
        "Configurations": [
            {
                "Classification": "export", 
                "Configurations": [], 
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ], 
        "Properties": {}
    }, 
    {
        "Classification": "spark-env", 
        "Configurations": [
            {
                "Classification": "export", 
                "Configurations": [], 
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ], 
        "Properties": {}
    }
]