Amazon Elastic MapReduce
Amazon EMR Release Guide

What's New?

This chapter gives an overview of features and issues resolved in the current release as well as the historical record of this information for previous releases.

Release 4.6.0


The following features are available in this release:

Issue Affecting Throughput Optimized HDD (st1) EBS Volume Types

An issue in the Linux kernel versions 4.2 and above significantly affects performance on Throughput Optimized HDD (st1) EBS volumes for EMR. This release (emr-4.6.0) uses kernel version 4.4.5 and hence is impacted. Therefore, we advise customers to not use emr-4.6.0 if they wish to use st1 EBS volumes. Customers can use emr-4.5.0 or prior EMR releases with st1 without impact. In addition, we will provide the fix with future releases.

Python Defaults

Python 3.4 is now installed by default, but Python 2.7 remains the system default. Customers may configure Python 3.4 as the system default using either a bootstrap action; you can use the configuration API to set PYSPARK_PYTHON export to /usr/bin/python3.4 in the spark-env classification in order to affect the Python version used by PySpark.

Java 8

Except for Presto, OpenJDK 1.7 is the default JDK used for all applications. However, both OpenJDK 1.7 and 1.8 are installed. For information about how to set JAVA_HOME for applications, see Configuring Applications to Use Java 8.

Known Issues Resolved from Previous Releases

  • Fixed an issue where application provisioning would sometimes randomly fail due to a generated password.

  • Previously, mysqld was installed on all nodes. Now, it is only installed on the master instance and only if the chosen application includes mysql-server as a component. Currently, the following applications include the mysql-server component: HCatalog, Hive, Hue, Presto-Sandbox, and Sqoop-Sandbox.

  • Changed yarn.scheduler.maximum-allocation-vcores to 80 from the default of 32, which fixes an issue introduced in emr-4.4.0 that mainly occurs with Spark while using the maximizeResourceAllocation option in a cluster whose core instance type is one of a few large instance types that have the YARN vcores set higher than 32; namely c4.8xlarge, cc2.8xlarge, hs1.8xlarge, i2.8xlarge, m2.4xlarge, r3.8xlarge, d2.8xlarge, or m4.10xlarge were affected by this issue.

  • s3-dist-cp now uses EMRFS for all Amazon S3 nominations and no longer stages to a temporary HDFS directory.

  • Fixed an issue with exception handling for client-side encryption multipart uploads.

  • Added an option to allow users to change the Amazon S3 storage class. By default this setting is STANDARD. The emrfs-site configuration classification setting is fs.s3.storageClass and the possible values are STANDARD, STANDARD_IA, and REDUCED_REDUNDANCY. For more information about storage classes, see Storage Classes in the Amazon Simple Storage Service Developer Guide.

Previous Releases

Release 4.5.0


The following features are available in this release:

Known Issues Resolved from Previous Releases

  • Fixed an issue where MySQL and Apache servers would not start after a node was rebooted.

  • Fixed an issue where IMPORT did not work correctly with non-partitioned tables stored in Amazon S3

  • Fixed an issue with Presto where it requires the staging directory to be /mnt/tmp rather than /tmp when writing to Hive tables.

Release 4.4.0


The following features are available in this release:

  • Added HCatalog 1.0.0

  • Added Sqoop-Sandbox 1.4.6

  • Upgraded to Presto 0.136

  • Upgraded to Zeppelin 0.5.6

  • Upgraded to Mahout 0.11.1

  • Enabled dynamicResourceAllocation by default.

  • Added a table of all configuration classifications for the release. See the Configuration Classifications table in Configuring Applications.

Known Issues Resolved from Previous Releases

  • Fixed an issue where the maximizeResourceAllocation setting would not reserve enough memory for YARN ApplicationMaster daemons.

  • Fixed an issue encountered with a custom DNS. If any entries in resolve.conf precede the custom entries provided, then the custom entries are not resolveable. This behavior was affected by clusters in a VPC where the default VPC nameserver is inserted as the top entry in resolve.conf.

  • Fixed an issue where the default Python moved to version 2.7 and boto was not installed for that version.

  • Fixed an issue where YARN containers and Spark applications would generate a unique Ganglia round robin database (rrd) file, which resulted in the first disk attached to the instance filling up. As a result of this fix, YARN container level metrics have been disabled and Spark application level metrics have been disabled.

  • Fixed an issue in log pusher where it would delete all empty log folders. The effect was that the Hive CLI was not able to log because log pusher was removing the empty user folder under /var/log/hive.

  • Fixed an issue affecting Hive imports, which affected partitioning and resulted in an error during import.

  • Fixed an issue where EMRFS and s3-dist-cp did not properly handle bucket names which contain periods.

  • Changed a behavior in EMRFS so that in versioning-enabled buckets the _$folder$ marker file is not continuously created, which may contribute to improved performance for versioning-enabled buckets.

  • Changed the behavior in EMRFS such that it does not use instruction files with the exception of cases where client-side encryption is enabled. If you wish to delete instruction files while using client-side encryption, you can set the emrfs-site.xml property, fs.s3.cse.cryptoStorageMode.deleteInstructionFiles.enabled, to true.

  • Changed YARN log aggregation to retain logs at the aggregation destination for two days. The default destination is your cluster's HDFS storage. If you wish to change this duration, change the value of yarn.log-aggregation.retain-seconds using the yarn-site configuration classification when you create your cluster. As always, you can save your application logs to Amazon S3 using the log-uri parameter when you create your cluster.

Patches Applied

The following patches from open source projects were included in this release:

Release 4.3.0


The following features are available in this release:

  • Upgraded to Hadoop 2.7.1

  • Upgraded to Spark 1.6.0

  • Upgraded Ganglia to 3.7.2

  • Upgraded Presto to 0.130

Amazon EMR made some changes to spark.dynamicAllocation.enabled when it is set to true; it is false by default. When set to true, this affects the defaults set by the maximizeResourceAllocation setting:

  • If spark.dynamicAllocation.enabled is set to true, spark.executor.instances is not set by maximizeResourceAllocation.

  • The spark.driver.memory setting is now configured based on the instance types in the cluster in a similar way to how spark.executors.memory is set. However, because the Spark driver application may run on either the master or one of the core instances (for example, in YARN client and cluster modes, respectively), the spark.driver.memory setting is set based on the instance type of the smaller instance type between these two instance groups.

  • The spark.default.parallelism setting is now set at twice the number of CPU cores available for YARN containers. In previous releases, this was half that value.

  • The calculations for the memory overhead reserved for Spark YARN processes was adjusted to be more accurate, resulting in a small increase in the total amount of memory available to Spark (that is, spark.executor.memory).

Known Issues Resolved from the Previous Releases

  • YARN log aggregation is now enabled by default.

  • Fixed an issue where logs would not be pushed to a cluster's Amazon S3 logs bucket when YARN log aggregation was enabled.

  • YARN container sizes now have a new minimum of 32 across all node types.

  • Fixed an issue with Ganglia which caused excessive disk I/O on the master node in large clusters.

  • Fixed an issue that prevented applications logs from being pushed to Amazon S3 when a cluster is shutting down.

  • Fixed an issue in EMRFS CLI that caused certain commands to fail.

  • Fixed an issue with Zeppelin that prevented dependencies from being loaded in the underlying SparkContext.

  • Fixed an issue which resulted from issuing a resize attempting to add instances.

  • Fixed an issue in Hive where CREATE TABLE AS SELECT makes excessive list calls to Amazon S3.

  • Fixed an issue where large clusters would not provision properly when Hue, Oozie, and Ganglia are installed.

  • Fixed in issue in s3-dist-cp where it would return a zero exit code even if it failed with an error.

Patches Applied

The following patches from open source projects were included in this release:

Release 4.2.0


The following features are available in this release:

  • Added Ganglia support

  • Upgraded to Spark 1.5.2

  • Upgraded to Presto 0.125

  • Upgraded Oozie to 4.2.0

  • Upgraded Zeppelin to 0.5.5

  • Upgraded the AWS Java SDK to 1.10.27

Known Issues Resolved from the Previous Releases

  • Fixed an issue with the EMRFS CLI where it did not use the default metadata table name.

  • Fixed an issue encountered when using ORC-backed tables in Amazon S3.

  • Fixed an issue encountered with a Python version mismatch in the Spark configuration.

  • Fixed an issue when a YARN node status fails to report because of DNS issues for clusters in a VPC.

  • Fixed an issue encountered when YARN decommissioned nodes, resulting in hanged applications or the inability to schedule new applications.

  • Fixed an issue encountered when clusters terminated with status TIMED_OUT_STARTING.

  • Fixed an issue encoutered when including the EMRFS Scala dependency in other builds. The Scala dependency has been removed.

Links to All Release Guides

The following are links to all versions of Amazon EMR Release Guide: