Menu
Amazon EMR
Amazon EMR Release Guide

What's New?

This documentation is for versions 4.x and 5.x of Amazon EMR. For information about Amazon EMR AMI versions 2.x and 3.x, see the Amazon EMR Developer Guide (PDF).

This chapter gives an overview of features and issues resolved in the current release of Amazon EMR as well as the historical record of this information for earlier-version releases back to version 4.2.0.

Release 5.7.0 (Most Recent)

The following release notes include information for the Amazon EMR 5.7.0 release. Changes are relative to the Amazon EMR 5.6.0 release.

Upgrades

  • Flink 1.3.0

  • Phoenix 4.11.0

  • Zeppelin 0.7.2

New Features

  • Added the ability to specify a custom Amazon Linux AMI when you create a cluster. For more information, see Using a Custom AMI.

Changes, Enhancements, and Resolved Issues

  • HBase

  • Presto—added ability to configure node.properties.

  • YARN—added ability to configure container-log4j.properties

  • Sqoop—backported SQOOP-2880, which introduces an argument that allows you to set the Sqoop temporary directory.

Release 5.6.0

The following release notes include information for the Amazon EMR 5.6.0 release. Changes are relative to the Amazon EMR 5.5.0 release.

Upgrades

  • Flink 1.2.1

  • HBase 1.3.1

  • Mahout 0.13.0. This is the first version of Mahout to support Spark 2.x in Amazon EMR version 5.0 and later.

  • Spark 2.1.1

Changes, Enhancements, and Resolved Issues

  • Presto

    • Added the ability to enable SSL/TLS secured communication between Presto nodes by enabling in-transit encryption using a security configuration. For more information, see In-transit Data Encryption.

    • Backported Presto 7661, which adds the VERBOSE option to the EXPLAIN ANALYZE statement to report more detailed, low level statistics about a query plan.

Release 5.5.0

The following release notes include information for the Amazon EMR 5.5.0 release. Changes are relative to the Amazon EMR 5.4.0 release.

Upgrades

  • Hue 3.12

  • Presto 0.170

  • Zeppelin 0.7.1

  • ZooKeeper 3.4.10

Changes, Enhancements, and Resolved Issues

  • Spark

  • Flink

    • Flink is now built with Scala 2.11. If you use the Scala API and libraries, we recommend that you use Scala 2.11 in your projects.

    • Addressed an issue where HADOOP_CONF_DIR and YARN_CONF_DIR defaults were not properly set, so start-scala-shell.sh failed to work. Also added the ability to set these values using env.hadoop.conf.dir and env.yarn.conf.dir in /etc/flink/conf/flink-conf.yaml or the flink-conf configuration classification.

    • Introduced a new EMR-specific command, flink-scala-shell as a wrapper for start-scala-shell.sh. We recommend using this command instead of start-scala-shell. The new command simplifies execution. For example, flink-scala-shell -n 2 starts a Flink Scala shell with a task parallelism of 2.

    • Introduced a new EMR-specific command, flink-yarn-session as a wrapper for yarn-session.sh. We recommend using this command instead of yarn-session. The new command simplifies execution. For example, flink-yarn-session -n 2 -d starts a long-running Flink session in a detatched state with two task managers.

    • Addressed (FLINK-6125) Commons httpclient is not shaded anymore in Flink 1.2.

  • Presto

    • Added support for LDAP authentication. Using LDAP with Presto on Amazon EMR requires that you enable HTTPS access for the Presto coordinator (http-server.https.enabled=true in config.properties). For configuration details, see LDAP Authentication in Presto documentation.

    • Added support for SHOW GRANTS.

  • Amazon EMR Base Linux AMI

    • Amazon EMR releases are now based on Amazon Linux 2017.03. For more information, see Amazon Linux AMI 2017.03 Release Notes.

    • Removed Python 2.6 from the Amazon EMR base Linux image. Python 2.7 and 3.4 are installed by default. You can install Python 2.6 manually if necessary.

Release 5.4.0

The following release notes include information for the Amazon EMR 5.4.0 release. Changes are relative to the Amazon EMR 5.3.0 release.

Upgrades

The following upgrades are available in this release:

  • Upgraded to Flink 1.2.0

  • Upgraded to Hbase 1.3.0

  • Upgraded to Phoenix 4.9.0

    Note

    If you upgrade from an earlier version of Amazon EMR to Amazon EMR version 5.4.0 or later and use secondary indexing, upgrade local indexes as described in the Apache Phoenix documentation. Amazon EMR removes the required configurations from the hbase-site classification, but indexes need to be repopulated. Online and offline upgrade of indexes are supported. Online upgrades are the default, which means indexes are repopulated while initializing from Phoenix clients of version 4.8.0 or greater. To specify offline upgrades, set the phoenix.client.localIndexUpgrade configuration to false in the phoenix-site classification, and then SSH to the master node to run psql [zookeeper] -1.

  • Upgraded to Presto 0.166

  • Upgraded to Zeppelin 0.7.0

Changes and Enhancements

The following are changes made to Amazon EMR releases for release label emr-5.4.0:

Release 5.3.0

The following release notes include information for the Amazon EMR 5.3.0 release. Changes are relative to the Amazon EMR 5.2.1 release.

Upgrades

The following upgrades are available in this release:

  • Upgraded to Hive 2.1.1

  • Upgraded to Hue 3.11.0

  • Upgraded to Spark 2.1.0

  • Upgraded to Oozie 4.3.0

  • Upgraded to Flink 1.1.4

Changes and Enhancements

The following are changes made to Amazon EMR releases for release label emr-5.3.0:

  • Added a patch to Hue that allows you to use the interpreters_shown_on_wheel setting to configure what interpreters to show first on the Notebook selection wheel, regardless of their ordering in the hue.ini file.

  • Added the hive-parquet-logging configuration classification, which you can use to configure values in Hive's parquet-logging.properties file.

Release 5.2.2

The following release notes include information for the Amazon EMR 5.2.2 release. Changes are relative to the Amazon EMR 5.2.1 release.

Known Issues Resolved from the Previous Releases

  • Backported SPARK-194459, which addresses an issue where reading from an ORC table with char/varchar columns can fail.

Release 5.2.1

The following release notes include information for the Amazon EMR 5.2.1 release. Changes are relative to the Amazon EMR 5.2.0 release.

Upgrades

The following upgrades are available in this release:

  • Upgraded to Presto 0.157.1. For more information, see Presto Release Notes in the Presto documentation.

  • Upgraded to Zookeeper 3.4.9. For more information, see ZooKeeper Release Notes in the Apache ZooKeeper documentation.

Changes and Enhancements

The following are changes made to Amazon EMR releases for release label emr-5.2.1:

  • Added support for the Amazon EC2 m4.16xlarge instance type in Amazon EMR version 4.8.3 and later, excluding 5.0.0, 5.0.3, and 5.2.0.

  • Amazon EMR releases are now based on Amazon Linux 2016.09. For more information, see https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/.

  • The location of Flink and YARN configuration paths are now set by default in /etc/default/flink that you don't need to set the environment variables FLINK_CONF_DIR and HADOOP_CONF_DIR when running the flink or yarn-session.sh driver scripts to launch Flink jobs.

  • Added support for FlinkKinesisConsumer class.

Known Issues Resolved from the Previous Releases

  • Fixed an issue in Hadoop where the ReplicationMonitor thread could get stuck for a long time because of a race between replication and deletion of the same file in a large cluster.

  • Fixed an issue where ControlledJob#toString failed with a null pointer exception (NPE) when job status was not successfully updated.

Release 5.2.0

The following release notes include information for the Amazon EMR 5.2.0 release. Changes are relative to the Amazon EMR 5.1.0 release.

Changes and enhancements

The following changes and enhancements are available in this release:

  • Added Amazon S3 storage mode for HBase.

  • Enables you to specify an Amazon S3 location for the HBase rootdir. For more information, see HBase on Amazon S3.

Upgrades

The following upgrades are available in this release:

  • Upgraded to Spark 2.0.2

Known Issues Resolved from the Previous Releases

  • Fixed an issue with /mnt being constrained to 2 TB on EBS-only instance types.

  • Fixed an issue with instance-controller and logpusher logs being output to their corresponding .out files instead of to their normal log4j-configured .log files, which rotate hourly. The .out files don't rotate, so this would eventually fill up the /emr partition. This issue only affects hardware virtual machine (HVM) instance types.

Release 5.1.0

The following release notes include information for the Amazon EMR 5.1.0 release. Changes are relative to the Amazon EMR 5.0.0 release.

Changes and enhancements

The following changes and enhancements are available in this release:

  • Added support for Flink 1.1.3.

  • Presto has been added as an option in the notebook section of Hue.

Upgrades

The following upgrades are available in this release:

  • Upgraded to HBase 1.2.3

  • Upgraded to Zeppelin 0.6.2

Known Issues Resolved from the Previous Releases

  • Fixed an issue with Tez queries on Amazon S3 with ORC files did not perform as well as earlier Amazon EMR 4.x versions.

Release 5.0.3

The following release notes include information for the Amazon EMR 5.0.3 release. Changes are relative to the Amazon EMR 5.0.0 release.

Upgrades

The following upgrades are available in this release:

  • Upgraded to Hadoop 2.7.3

  • Upgraded to Presto 0.152.3, which includes support for the Presto web interface. You can access the Presto web interface on the Presto coordinator using port 8889. For more information about the Presto web interface, see Web Interface in the Presto documentation.

  • Upgraded to Spark 2.0.1

  • Amazon EMR releases are now based on Amazon Linux 2016.09. For more information, see https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/.

Release 5.0.0

Upgrades

The following upgrades are available in this release:

  • Upgraded to Hive 2.1

  • Upgraded to Presto 0.150

  • Upgraded to Spark 2.0

  • Upgraded to Hue 3.10.0

  • Upgraded to Pig 0.16.0

  • Upgraded to Tez 0.8.4

  • Upgraded to Zeppelin 0.6.1

Changes and Enhancements

The following are changes made to Amazon EMR releases for release label emr-5.0.0 or greater:

  • Amazon EMR supports the latest open-source versions of Hive (version 2.1) and Pig (version 0.16.0). If you have used Hive or Pig on Amazon EMR in the past, this may affect some use cases. For more information, see Hive and Pig.

  • The default execution engine for Hive and Pig is now Tez. To change this, you would edit the appropriate values in the hive-site and pig-properties configuration classifications, respectively.

  • An enhanced step debugging feature was added, which allows you to see the root cause of step failures if the service can determine the cause. For more information, see Enhanced Step Debugging in the Amazon EMR Management Guide.

  • Applications that previously ended with "-Sandbox" no longer have that suffix. This may break your automation, for example, if you are using scripts to launch clusters with these applications. The following table shows application names in Amazon EMR 4.7.2 versus Amazon EMR 5.0.0.

    Application name changes

    Amazon EMR 4.7.2 Amazon EMR 5.0.0
    Oozie-Sandbox Oozie
    Presto-Sandbox Presto
    Sqoop-Sandbox Sqoop
    Zeppelin-Sandbox Zeppelin
    ZooKeeper-Sandbox ZooKeeper
  • Spark is now compiled for Scala 2.11.

  • Java 8 is now the default JVM. All applications run using the Java 8 runtime. There are no changes to any application’s byte code target. Most applications continue to target Java 7.

  • Zeppelin now includes authentication features. For more information, see Zeppelin.

  • Added support for security configurations, which allow you to create and apply encryption options more easily. For more information, see Data Encryption.

Release 4.9.2

The following release notes include information for the Amazon EMR 4.9.2 release. Changes are relative to the Amazon EMR 4.9.1 release.

Minor changes, bug fixes, and enhancements were made in this release.

Release 4.9.1

The following release notes include information for the Amazon EMR 4.9.1 release. Changes are relative to the Amazon EMR 4.8.4 release.

Known Issues Resolved from the Previous Releases

  • Backports of HIVE-9976 and HIVE-10106

  • Fixed an issue in YARN where a large number of nodes (greater than 2,000) and containers (greater than 5,000) would cause an out of memory error, for example: "Exception in thread 'main' java.lang.OutOfMemoryError".

Changes and Enhancements

The following are changes made to Amazon EMR releases for release label emr-4.9.1:

Release 4.8.4

The following release notes include information for the Amazon EMR 4.8.4 release. Changes are relative to the Amazon EMR 4.8.3 release.

Minor changes, bug fixes, and enhancements were made in this release.

Release 4.8.3

The following release notes include information for the Amazon EMR 4.8.3 release. Changes are relative to the Amazon EMR 4.8.2 release.

Upgrades

The following upgrades are available in this release:

  • Upgraded to Presto 0.157.1. For more information, see Presto Release Notes in the Presto documentation.

  • Upgraded to Spark 1.6.3. For more information, see Spark Release Notes in the Apache Spark documentation.

  • Upgraded to ZooKeeper 3.4.9. For more information, see ZooKeeper Release Notes in the Apache ZooKeeper documentation.

Changes and Enhancements

The following are changes made to Amazon EMR releases for release label emr-4.8.3:

Known Issues Resolved from the Previous Releases

  • Fixed an issue in Hadoop where the ReplicationMonitor thread could get stuck for a long time because of a race between replication and deletion of the same file in a large cluster.

  • Fixed an issue where ControlledJob#toString failed with a null pointer exception (NPE) when job status was not successfully updated.

Release 4.8.2

The following release notes include information for the Amazon EMR 4.8.2 release. Changes are relative to the Amazon EMR 4.8.0 release.

Upgrades

The following upgrades are available in this release:

  • Upgraded to Hadoop 2.7.3

  • Upgraded to Presto 0.152.3, which includes support for the Presto web interface. You can access the Presto web interface on the Presto coordinator using port 8889. For more information about the Presto web interface, see Web Interface in the Presto documentation.

  • Amazon EMR releases are now based on Amazon Linux 2016.09. For more information, see https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/.

Release 4.8.0

Upgrades

The following upgrades are available in this release:

  • Upgraded to HBase 1.2.2

  • Upgraded to Presto-Sandbox 0.151

  • Upgraded to Tez 0.8.4

  • Upgraded to Zeppelin-Sandbox 0.6.1

Changes and Enhancements

The following are changes made to Amazon EMR releases for release label emr-4.8.0:

  • Fixed an issue in YARN where the ApplicationMaster would attempt to clean up containers that no longer exist because their instances have been terminated.

  • Corrected the hive-server2 URL for Hive2 actions in the Oozie examples.

  • Added support for additional Presto catalogs.

  • Backported patches: HIVE-8948, HIVE-12679, HIVE-13405, PHOENIX-3116, HADOOP-12689

  • Added support for security configurations, which allow you to create and apply encryption options more easily. For more information, see Data Encryption.

Release 4.7.2

The following release notes include information for Amazon EMR 4.7.2.

Features

The following features are available in this release:

  • Upgraded to Mahout 0.12.2

  • Upgraded to Presto 0.148

  • Upgraded to Spark 1.6.2

  • You can now create an AWSCredentialsProvider for use with EMRFS using a URI as a parameter. For more information, see Create an AWSCredentialsProvider for EMRFS.

  • EMRFS now allows users to configure a custom DynamoDB endpoint for their Consistent View metadata using the fs.s3.consistent.dynamodb.endpoint property in emrfs-site.xml.

  • Added a script in /usr/bin called spark-example, which wraps /usr/lib/spark/spark/bin/run-example so you can run examples directly. For instance, to run the SparkPi example that comes with the Spark distribution, you can run spark-example SparkPi 100 from the command line or using command-runner.jar as a step in the API.

Known Issues Resolved from Previous Releases

  • Fixed an issue where Oozie had the spark-assembly.jar was not in the correct location when Spark was also installed, which resulted in failure to launch Spark applications with Oozie.

  • Fixed an issue with Spark Log4j-based logging in YARN containers.

Release 4.7.1

Known Issues Resolved from Previous Releases

  • Fixed an issue that extended the startup time of clusters launched in a VPC with private subnets. The bug only impacted clusters launched with the Amazon EMR 4.7.0 release.

  • Fixed an issue that improperly handled listing of files in Amazon EMR for clusters launched with the Amazon EMR 4.7.0 release.

Release 4.7.0

Important

Amazon EMR 4.7.0 is deprecated. Use Amazon EMR 4.7.1 or later instead.

Features

The following features are available in this release:

  • Added Apache Phoenix 4.7.0

  • Added Apache Tez 0.8.3

  • Upgraded to HBase 1.2.1

  • Upgraded to Mahout 0.12.0

  • Upgraded to Presto 0.147

  • Upgraded the AWS SDK for Java to 1.10.75

  • The final flag was removed from the mapreduce.cluster.local.dir property in mapred-site.xml to allow users to run Pig in local mode.

Amazon Redshift JDBC Drivers Available on Cluster

Amazon Redshift JDBC drivers are now included at /usr/share/aws/redshift/jdbc. /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar is the JDBC 4.1-compatible Amazon Redshift driver and /usr/share/aws/redshift/jdbc/RedshiftJDBC4.jar is the JDBC 4.0-compatible Amazon Redshift driver. For more information, see Configure a JDBC Connection in the Amazon Redshift Cluster Management Guide.

Java 8

Except for Presto, OpenJDK 1.7 is the default JDK used for all applications. However, both OpenJDK 1.7 and 1.8 are installed. For information about how to set JAVA_HOME for applications, see Configuring Applications to Use Java 8.

Known Issues Resolved from Previous Releases

  • Fixed a kernel issue that significantly affected performance on Throughput Optimized HDD (st1) EBS volumes for Amazon EMR in emr-4.6.0.

  • Fixed an issue where a cluster would fail if any HDFS encryption zone were specified without choosing Hadoop as an application.

  • Changed the default HDFS write policy from RoundRobin to AvailableSpaceVolumeChoosingPolicy. Some volumes were not properly utilized with the RoundRobin configuration, which resulted in failed core nodes and an unreliable HDFS.

  • Fixed an issue with the EMRFS CLI, which would cause an exception when creating the default DynamoDB metadata table for consistent views.

  • Fixed a deadlock issue in EMRFS that potentially occurred during multipart rename and copy operations.

  • Fixed an issue with EMRFS that caused the CopyPart size default to be 5 MB. The default is now properly set at 128 MB.

  • Fixed an issue with the Zeppelin upstart configuration that potentially prevented you from stopping the service.

  • Fixed an issue with Spark and Zeppelin, which prevented you from using the s3a:// URI scheme because /usr/lib/hadoop/hadoop-aws.jar was not properly loaded in their respective classpath.

  • Backported HUE-2484.

  • Backported a commit from Hue 3.9.0 (no JIRA exists) to fix an issue with the HBase browser sample.

  • Backported HIVE-9073.

Release 4.6.0

Features

The following features are available in this release:

Issue Affecting Throughput Optimized HDD (st1) EBS Volume Types

An issue in the Linux kernel versions 4.2 and above significantly affects performance on Throughput Optimized HDD (st1) EBS volumes for EMR. This release (emr-4.6.0) uses kernel version 4.4.5 and hence is impacted. Therefore, we recommend not using emr-4.6.0 if you want to use st1 EBS volumes. You can use emr-4.5.0 or prior Amazon EMR releases with st1 without impact. In addition, we provide the fix with future releases.

Python Defaults

Python 3.4 is now installed by default, but Python 2.7 remains the system default. You may configure Python 3.4 as the system default using either a bootstrap action; you can use the configuration API to set PYSPARK_PYTHON export to /usr/bin/python3.4 in the spark-env classification to affect the Python version used by PySpark.

Java 8

Except for Presto, OpenJDK 1.7 is the default JDK used for all applications. However, both OpenJDK 1.7 and 1.8 are installed. For information about how to set JAVA_HOME for applications, see Configuring Applications to Use Java 8.

Known Issues Resolved from Previous Releases

  • Fixed an issue where application provisioning would sometimes randomly fail due to a generated password.

  • Previously, mysqld was installed on all nodes. Now, it is only installed on the master instance and only if the chosen application includes mysql-server as a component. Currently, the following applications include the mysql-server component: HCatalog, Hive, Hue, Presto-Sandbox, and Sqoop-Sandbox.

  • Changed yarn.scheduler.maximum-allocation-vcores to 80 from the default of 32, which fixes an issue introduced in emr-4.4.0 that mainly occurs with Spark while using the maximizeResourceAllocation option in a cluster whose core instance type is one of a few large instance types that have the YARN vcores set higher than 32; namely c4.8xlarge, cc2.8xlarge, hs1.8xlarge, i2.8xlarge, m2.4xlarge, r3.8xlarge, d2.8xlarge, or m4.10xlarge were affected by this issue.

  • s3-dist-cp now uses EMRFS for all Amazon S3 nominations and no longer stages to a temporary HDFS directory.

  • Fixed an issue with exception handling for client-side encryption multipart uploads.

  • Added an option to allow users to change the Amazon S3 storage class. By default this setting is STANDARD. The emrfs-site configuration classification setting is fs.s3.storageClass and the possible values are STANDARD, STANDARD_IA, and REDUCED_REDUNDANCY. For more information about storage classes, see Storage Classes in the Amazon Simple Storage Service Developer Guide.

Release 4.5.0

Features

The following features are available in this release:

  • Upgraded to Spark 1.6.1

  • Upgraded to Hadoop 2.7.2

  • Upgraded to Presto 0.140

  • Added AWS KMS support for Amazon S3 server-side encryption.

Known Issues Resolved from Previous Releases

  • Fixed an issue where MySQL and Apache servers would not start after a node was rebooted.

  • Fixed an issue where IMPORT did not work correctly with non-partitioned tables stored in Amazon S3

  • Fixed an issue with Presto where it requires the staging directory to be /mnt/tmp rather than /tmp when writing to Hive tables.

Release 4.4.0

Features

The following features are available in this release:

  • Added HCatalog 1.0.0

  • Added Sqoop-Sandbox 1.4.6

  • Upgraded to Presto 0.136

  • Upgraded to Zeppelin 0.5.6

  • Upgraded to Mahout 0.11.1

  • Enabled dynamicResourceAllocation by default.

  • Added a table of all configuration classifications for the release. For more information, see the Configuration Classifications table in Configuring Applications.

Known Issues Resolved from Previous Releases

  • Fixed an issue where the maximizeResourceAllocation setting would not reserve enough memory for YARN ApplicationMaster daemons.

  • Fixed an issue encountered with a custom DNS. If any entries in resolve.conf precede the custom entries provided, then the custom entries are not resolvable. This behavior was affected by clusters in a VPC where the default VPC name server is inserted as the top entry in resolve.conf.

  • Fixed an issue where the default Python moved to version 2.7 and boto was not installed for that version.

  • Fixed an issue where YARN containers and Spark applications would generate a unique Ganglia round robin database (rrd) file, which resulted in the first disk attached to the instance filling up. Because of this fix, YARN container level metrics have been disabled and Spark application level metrics have been disabled.

  • Fixed an issue in log pusher where it would delete all empty log folders. The effect was that the Hive CLI was not able to log because log pusher was removing the empty user folder under /var/log/hive.

  • Fixed an issue affecting Hive imports, which affected partitioning and resulted in an error during import.

  • Fixed an issue where EMRFS and s3-dist-cp did not properly handle bucket names that contain periods.

  • Changed a behavior in EMRFS so that in versioning-enabled buckets the _$folder$ marker file is not continuously created, which may contribute to improved performance for versioning-enabled buckets.

  • Changed the behavior in EMRFS such that it does not use instruction files except for cases where client-side encryption is enabled. If you want to delete instruction files while using client-side encryption, you can set the emrfs-site.xml property, fs.s3.cse.cryptoStorageMode.deleteInstructionFiles.enabled, to true.

  • Changed YARN log aggregation to retain logs at the aggregation destination for two days. The default destination is your cluster's HDFS storage. If you want to change this duration, change the value of yarn.log-aggregation.retain-seconds using the yarn-site configuration classification when you create your cluster. As always, you can save your application logs to Amazon S3 using the log-uri parameter when you create your cluster.

Patches Applied

The following patches from open source projects were included in this release:

Release 4.3.0

Features

The following features are available in this release:

  • Upgraded to Hadoop 2.7.1

  • Upgraded to Spark 1.6.0

  • Upgraded Ganglia to 3.7.2

  • Upgraded Presto to 0.130

Amazon EMR made some changes to spark.dynamicAllocation.enabled when it is set to true; it is false by default. When set to true, this affects the defaults set by the maximizeResourceAllocation setting:

  • If spark.dynamicAllocation.enabled is set to true, spark.executor.instances is not set by maximizeResourceAllocation.

  • The spark.driver.memory setting is now configured based on the instance types in the cluster in a similar way to how spark.executors.memory is set. However, because the Spark driver application may run on either the master or one of the core instances (for example, in YARN client and cluster modes, respectively), the spark.driver.memory setting is set based on the instance type of the smaller instance type between these two instance groups.

  • The spark.default.parallelism setting is now set at twice the number of CPU cores available for YARN containers. In previous releases, this was half that value.

  • The calculations for the memory overhead reserved for Spark YARN processes was adjusted to be more accurate, resulting in a small increase in the total amount of memory available to Spark (that is, spark.executor.memory).

Known Issues Resolved from the Previous Releases

  • YARN log aggregation is now enabled by default.

  • Fixed an issue where logs would not be pushed to a cluster's Amazon S3 logs bucket when YARN log aggregation was enabled.

  • YARN container sizes now have a new minimum of 32 across all node types.

  • Fixed an issue with Ganglia that caused excessive disk I/O on the master node in large clusters.

  • Fixed an issue that prevented applications logs from being pushed to Amazon S3 when a cluster is shutting down.

  • Fixed an issue in EMRFS CLI that caused certain commands to fail.

  • Fixed an issue with Zeppelin that prevented dependencies from being loaded in the underlying SparkContext.

  • Fixed an issue that resulted from issuing a resize attempting to add instances.

  • Fixed an issue in Hive where CREATE TABLE AS SELECT makes excessive list calls to Amazon S3.

  • Fixed an issue where large clusters would not provision properly when Hue, Oozie, and Ganglia are installed.

  • Fixed an issue in s3-dist-cp where it would return a zero exit code even if it failed with an error.

Patches Applied

The following patches from open source projects were included in this release:

Release 4.2.0

Features

The following features are available in this release:

  • Added Ganglia support

  • Upgraded to Spark 1.5.2

  • Upgraded to Presto 0.125

  • Upgraded Oozie to 4.2.0

  • Upgraded Zeppelin to 0.5.5

  • Upgraded the AWS SDK for Java to 1.10.27

Known Issues Resolved from the Previous Releases

  • Fixed an issue with the EMRFS CLI where it did not use the default metadata table name.

  • Fixed an issue encountered when using ORC-backed tables in Amazon S3.

  • Fixed an issue encountered with a Python version mismatch in the Spark configuration.

  • Fixed an issue when a YARN node status fails to report because of DNS issues for clusters in a VPC.

  • Fixed an issue encountered when YARN decommissioned nodes, resulting in hanged applications or the inability to schedule new applications.

  • Fixed an issue encountered when clusters terminated with status TIMED_OUT_STARTING.

  • Fixed an issue encountered when including the EMRFS Scala dependency in other builds. The Scala dependency has been removed.