Menu
Amazon EMR
Amazon EMR Release Guide

What's New?

This chapter gives an overview of features and issues resolved in the current release of Amazon EMR as well as the historical record of this information for previous releases.

Release 5.0.0

The following release notes include information for the EMR 5.0.0 release. Changes are relative to the EMR 4.7.2 release.

Please see Previous Releases for more infomation about other EMR fixes and features.

Upgrades

The following upgrades are available in this release:

  • Upgraded to Hive 2.1

  • Upgraded to Presto 0.150

  • Upgraded to Spark 2.0

  • Upgraded to Hue 3.10.0

  • Upgraded to Pig 0.16.0

  • Upgraded to Tez 0.8.4

  • Upgraded to Zeppelin 0.6.1

Changes and Enhancements

The following are changes made to EMR releases for release label emr-5.0.0 or greater:

  • EMR supports the latest open-source versions of Hive (version 2.1) and Pig (version 0.16.0). If you have used Hive or Pig on EMR in the past, this may affect some use cases. For more information, see Apache Hive and Apache Pig.

  • The default execution engine for Hive and Pig is now Tez. To change this, you would edit the appropriate values in the hive-site and pig-properties configuration classifications, respectively.

  • An enhanced step debugging feature was added, which allows you to see the root cause of step failures if the service can determine the cause. For more information, see Enhanced Step Debugging in the Amazon EMR Management Guide.

  • Applications that previously ended with "-Sandbox" no longer have that suffix. This may break your automation, for example, if you are using scripts to launch clusters with these applications. The following table shows application names in EMR 4.7.2 versus EMR 5.0.0.

    Application name changes

    EMR 4.7.2EMR 5.0.0
    Oozie-SandboxOozie
    Presto-SandboxPresto
    Sqoop-SandboxSqoop
    Zeppelin-SandboxZeppelin
    ZooKeeper-SandboxZooKeeper

  • Spark is now compiled for Scala 2.11.

  • Java 8 is now the default JVM. All applcations run using the Java 8 runtime. There are no changes to any application’s byte code target. Most applications continue to target Java 7.

  • Zeppelin now includes authentication features. For more information, see Apache Zeppelin.

Previous Releases

Release 4.7.2

The following release notes include information for EMR 4.7.2. Please see Previous Releases for more infomation about previous EMR fixes and features.

Features

The following features are available in this release:

  • Upgraded to Mahout 0.12.2

  • Upgraded to Presto 0.148

  • Upgraded to Spark 1.6.2

  • You can now create an AWSCredentialsProvider for use with EMRFS using a URI as a parameter. For more information, see Creating an AWSCredentialsProvider for EMRFS.

  • EMRFS now allows users to configure a custom DynamoDB endpoint for their Consistent View metadata using the fs.s3.consistent.dynamodb.endpoint property in emrfs-site.xml.

  • Added a script in /usr/bin called spark-example, which wraps /usr/lib/spark/spark/bin/run-example so you can run examples directly. For instance, to run the SparkPi example that comes with the Spark distribution, you can run spark-example SparkPi 100 from the command line or using command-runner.jar as a step in the API.

Known Issues Resolved from Previous Releases

  • Fixed an issue where Oozie had the spark-assembly.jar was not in the correct location when Spark was also installed, which resulted in failure to launch Spark applications with Oozie.

  • Fixed an issue with Spark Log4j-based logging in YARN containers.

Release 4.7.1

Known Issues Resolved from Previous Releases

  • Fixed an issue that extended the startup time of clusters launched in a VPC with private subnets. The bug only impacted clusters launched with the EMR 4.7.0 release.

  • Fixed an issue that improperly handled listing of files in Amazon EMR for clusters launched with the EMR 4.7.0 release.

Release 4.7.0

Important

EMR 4.7.0 is deprecated. Please use EMR 4.7.1 or later instead.

Features

The following features are available in this release:

  • Added Apache Phoenix 4.7.0

  • Added Apache Tez 0.8.3

  • Upgraded to HBase 1.2.1

  • Upgraded to Mahout 0.12.0

  • Upgraded to Presto 0.147

  • Upgraded the AWS Java SDK to 1.10.75

  • The final flag was removed from the mapreduce.cluster.local.dir property in mapred-site.xml to allow users to run Pig in local mode.

Amazon Redshift JDBC Drivers Available on Cluster

Amazon Redshift JDBC drivers are now included at /usr/share/aws/redshift/jdbc. /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar is the JDBC 4.1-compatible Amazon Redshift driver and /usr/share/aws/redshift/jdbc/RedshiftJDBC4.jar is the JDBC 4.0-compatible Amazon Redshift driver. For more information, see Configure a JDBC Connection in the Amazon Redshift Cluster Management Guide.

Java 8

Except for Presto, OpenJDK 1.7 is the default JDK used for all applications. However, both OpenJDK 1.7 and 1.8 are installed. For information about how to set JAVA_HOME for applications, see Configuring Applications to Use Java 8.

Known Issues Resolved from Previous Releases

  • Fixed a kernel issue that significantly affected performance on Throughput Optimized HDD (st1) EBS volumes for Amazon EMR in emr-4.6.0.

  • Fixed an issue where a cluster would fail if any HDFS encryption zone were specified without choosing Hadoop as an application.

  • Changed the default HDFS write policy from RoundRobin to AvailableSpaceVolumeChoosingPolicy. Some volumes were not properly utilized with the RoundRobin configuration, which resulted in failed core nodes and an unreliable HDFS.

  • Fixed an issue with the EMRFS CLI, which would cause an exception when creating the default DynamoDB metadata table for consistent views.

  • Fixed a deadlock issue in EMRFS that potentially occurred during multipart rename and copy operations.

  • Fixed an issue with EMRFS that caused the CopyPart size default to be 5 MB. The default is now properly set at 128 MB.

  • Fixed an issue with the Zeppelin upstart configuration which potentially prevented customers from stopping the service.

  • Fixed an issue with Spark and Zeppelin, which prevented customers from using the s3a:// URI scheme because /usr/lib/hadoop/hadoop-aws.jar was not properly loaded in their respective classpath.

  • Backported HUE-2484.

  • Backported a commit from Hue 3.9.0 (no JIRA exists) to fix an issue with the HBase browser sample.

  • Backported HIVE-9073.

Release 4.6.0

Features

The following features are available in this release:

Issue Affecting Throughput Optimized HDD (st1) EBS Volume Types

An issue in the Linux kernel versions 4.2 and above significantly affects performance on Throughput Optimized HDD (st1) EBS volumes for EMR. This release (emr-4.6.0) uses kernel version 4.4.5 and hence is impacted. Therefore, we advise customers to not use emr-4.6.0 if they wish to use st1 EBS volumes. Customers can use emr-4.5.0 or prior EMR releases with st1 without impact. In addition, we will provide the fix with future releases.

Python Defaults

Python 3.4 is now installed by default, but Python 2.7 remains the system default. Customers may configure Python 3.4 as the system default using either a bootstrap action; you can use the configuration API to set PYSPARK_PYTHON export to /usr/bin/python3.4 in the spark-env classification in order to affect the Python version used by PySpark.

Java 8

Except for Presto, OpenJDK 1.7 is the default JDK used for all applications. However, both OpenJDK 1.7 and 1.8 are installed. For information about how to set JAVA_HOME for applications, see Configuring Applications to Use Java 8.

Known Issues Resolved from Previous Releases

  • Fixed an issue where application provisioning would sometimes randomly fail due to a generated password.

  • Previously, mysqld was installed on all nodes. Now, it is only installed on the master instance and only if the chosen application includes mysql-server as a component. Currently, the following applications include the mysql-server component: HCatalog, Hive, Hue, Presto-Sandbox, and Sqoop-Sandbox.

  • Changed yarn.scheduler.maximum-allocation-vcores to 80 from the default of 32, which fixes an issue introduced in emr-4.4.0 that mainly occurs with Spark while using the maximizeResourceAllocation option in a cluster whose core instance type is one of a few large instance types that have the YARN vcores set higher than 32; namely c4.8xlarge, cc2.8xlarge, hs1.8xlarge, i2.8xlarge, m2.4xlarge, r3.8xlarge, d2.8xlarge, or m4.10xlarge were affected by this issue.

  • s3-dist-cp now uses EMRFS for all Amazon S3 nominations and no longer stages to a temporary HDFS directory.

  • Fixed an issue with exception handling for client-side encryption multipart uploads.

  • Added an option to allow users to change the Amazon S3 storage class. By default this setting is STANDARD. The emrfs-site configuration classification setting is fs.s3.storageClass and the possible values are STANDARD, STANDARD_IA, and REDUCED_REDUNDANCY. For more information about storage classes, see Storage Classes in the Amazon Simple Storage Service Developer Guide.

Release 4.5.0

Features

The following features are available in this release:

  • Upgraded to Spark 1.6.1

  • Upgraded to Hadoop 2.7.2

  • Upgraded to Presto 0.140

  • Added AWS KMS support for Amazon S3 server-side encryption.

Known Issues Resolved from Previous Releases

  • Fixed an issue where MySQL and Apache servers would not start after a node was rebooted.

  • Fixed an issue where IMPORT did not work correctly with non-partitioned tables stored in Amazon S3

  • Fixed an issue with Presto where it requires the staging directory to be /mnt/tmp rather than /tmp when writing to Hive tables.

Release 4.4.0

Features

The following features are available in this release:

  • Added HCatalog 1.0.0

  • Added Sqoop-Sandbox 1.4.6

  • Upgraded to Presto 0.136

  • Upgraded to Zeppelin 0.5.6

  • Upgraded to Mahout 0.11.1

  • Enabled dynamicResourceAllocation by default.

  • Added a table of all configuration classifications for the release. See the Configuration Classifications table in Configuring Applications.

Known Issues Resolved from Previous Releases

  • Fixed an issue where the maximizeResourceAllocation setting would not reserve enough memory for YARN ApplicationMaster daemons.

  • Fixed an issue encountered with a custom DNS. If any entries in resolve.conf precede the custom entries provided, then the custom entries are not resolveable. This behavior was affected by clusters in a VPC where the default VPC nameserver is inserted as the top entry in resolve.conf.

  • Fixed an issue where the default Python moved to version 2.7 and boto was not installed for that version.

  • Fixed an issue where YARN containers and Spark applications would generate a unique Ganglia round robin database (rrd) file, which resulted in the first disk attached to the instance filling up. As a result of this fix, YARN container level metrics have been disabled and Spark application level metrics have been disabled.

  • Fixed an issue in log pusher where it would delete all empty log folders. The effect was that the Hive CLI was not able to log because log pusher was removing the empty user folder under /var/log/hive.

  • Fixed an issue affecting Hive imports, which affected partitioning and resulted in an error during import.

  • Fixed an issue where EMRFS and s3-dist-cp did not properly handle bucket names which contain periods.

  • Changed a behavior in EMRFS so that in versioning-enabled buckets the _$folder$ marker file is not continuously created, which may contribute to improved performance for versioning-enabled buckets.

  • Changed the behavior in EMRFS such that it does not use instruction files with the exception of cases where client-side encryption is enabled. If you wish to delete instruction files while using client-side encryption, you can set the emrfs-site.xml property, fs.s3.cse.cryptoStorageMode.deleteInstructionFiles.enabled, to true.

  • Changed YARN log aggregation to retain logs at the aggregation destination for two days. The default destination is your cluster's HDFS storage. If you wish to change this duration, change the value of yarn.log-aggregation.retain-seconds using the yarn-site configuration classification when you create your cluster. As always, you can save your application logs to Amazon S3 using the log-uri parameter when you create your cluster.

Patches Applied

The following patches from open source projects were included in this release:

Release 4.3.0

Features

The following features are available in this release:

  • Upgraded to Hadoop 2.7.1

  • Upgraded to Spark 1.6.0

  • Upgraded Ganglia to 3.7.2

  • Upgraded Presto to 0.130

Amazon EMR made some changes to spark.dynamicAllocation.enabled when it is set to true; it is false by default. When set to true, this affects the defaults set by the maximizeResourceAllocation setting:

  • If spark.dynamicAllocation.enabled is set to true, spark.executor.instances is not set by maximizeResourceAllocation.

  • The spark.driver.memory setting is now configured based on the instance types in the cluster in a similar way to how spark.executors.memory is set. However, because the Spark driver application may run on either the master or one of the core instances (for example, in YARN client and cluster modes, respectively), the spark.driver.memory setting is set based on the instance type of the smaller instance type between these two instance groups.

  • The spark.default.parallelism setting is now set at twice the number of CPU cores available for YARN containers. In previous releases, this was half that value.

  • The calculations for the memory overhead reserved for Spark YARN processes was adjusted to be more accurate, resulting in a small increase in the total amount of memory available to Spark (that is, spark.executor.memory).

Known Issues Resolved from the Previous Releases

  • YARN log aggregation is now enabled by default.

  • Fixed an issue where logs would not be pushed to a cluster's Amazon S3 logs bucket when YARN log aggregation was enabled.

  • YARN container sizes now have a new minimum of 32 across all node types.

  • Fixed an issue with Ganglia which caused excessive disk I/O on the master node in large clusters.

  • Fixed an issue that prevented applications logs from being pushed to Amazon S3 when a cluster is shutting down.

  • Fixed an issue in EMRFS CLI that caused certain commands to fail.

  • Fixed an issue with Zeppelin that prevented dependencies from being loaded in the underlying SparkContext.

  • Fixed an issue which resulted from issuing a resize attempting to add instances.

  • Fixed an issue in Hive where CREATE TABLE AS SELECT makes excessive list calls to Amazon S3.

  • Fixed an issue where large clusters would not provision properly when Hue, Oozie, and Ganglia are installed.

  • Fixed in issue in s3-dist-cp where it would return a zero exit code even if it failed with an error.

Patches Applied

The following patches from open source projects were included in this release:

Release 4.2.0

Features

The following features are available in this release:

  • Added Ganglia support

  • Upgraded to Spark 1.5.2

  • Upgraded to Presto 0.125

  • Upgraded Oozie to 4.2.0

  • Upgraded Zeppelin to 0.5.5

  • Upgraded the AWS Java SDK to 1.10.27

Known Issues Resolved from the Previous Releases

  • Fixed an issue with the EMRFS CLI where it did not use the default metadata table name.

  • Fixed an issue encountered when using ORC-backed tables in Amazon S3.

  • Fixed an issue encountered with a Python version mismatch in the Spark configuration.

  • Fixed an issue when a YARN node status fails to report because of DNS issues for clusters in a VPC.

  • Fixed an issue encountered when YARN decommissioned nodes, resulting in hanged applications or the inability to schedule new applications.

  • Fixed an issue encountered when clusters terminated with status TIMED_OUT_STARTING.

  • Fixed an issue encoutered when including the EMRFS Scala dependency in other builds. The Scala dependency has been removed.