What's New? - Amazon EMR

What's New?

This topic covers features and issues resolved in the current release of Amazon EMR 6.x series and 5.x series. These release notes are also available on the Release 6.0.0 Tab and Release 5.30.1 Tab, along with the application versions, component versions, and available configuration classifications for this release.

Subscribe to the RSS feed for Amazon EMR release notes at https://docs.aws.amazon.com/emr/latest/ReleaseGuide/amazon-emr-release-notes.rss to receive updates when a new Amazon EMR release version is available.

For earlier release notes going back to release version 4.2.0, see Amazon EMR What's New History.

Note

Twenty-five previous Amazon EMR release versions now use AWS Signature Version 4 to authenticate requests to Amazon S3. The use of AWS Signature version 2 is being phased out and new S3 buckets created after June 24, 2020 will not support Signature Version 2 signed requests. Existing buckets will continue to support Signature Version 2. We recommend migrating to an Amazon EMR release that supports Signature Version 4 so you can continue accessing new S3 buckets and avoid any potential interruption to your workloads.

The following EMR releases are now available that supports Signature Version 4: emr-4.7.4, emr-4.8.5, emr-4.9.6, emr-4.10.1, emr-5.1.1, emr-5.2.3, emr-5.3.2, emr-5.4.1, emr-5.5.4, emr-5.6.1, emr-5.7.1, emr-5.8.3, emr-5.9.1, emr-5.10.1, emr-5.11.4, emr-5.12.3, emr-5.13.1, emr-5.14.2, emr-5.15.1, emr-5.16.1, emr-5.17.2, emr-5.18.1, emr-5.19.1, emr-5.20.1, and emr-5.21.2. EMR version 5.22.0 and later already support Signature Version 4.

You do not need to change your application code to use Signature Version 4 if you are using Amazon EMR applications, such as Apache Spark, Apache Hive, Presto, etc. If you are using custom applications, which are not included with Amazon EMR, you may need to update your code to use Signature Version 4. For more information about what updates may be required, see Moving from Signature Version 2 to Signature Version 4.

Release 6.0.0 (Latest version of Amazon EMR 6.x series)

New Amazon EMR release versions are made available in different regions over a period of several days, beginning with the first region on the initial release date. The latest release version may not be available in your region during this period.

The following release notes include information for Amazon EMR release version 6.0.0.

Initial release date: March 10, 2020

Supported Applications

  • AWS SDK for Java version 1.11.711

  • Ganglia version 3.7.2

  • Hadoop version 3.2.1

  • HBase version 2.2.3

  • HCatalog version 3.1.2

  • Hive version 3.1.2

  • Hudi version 0.5.0-incubating

  • Hue version 4.4.0

  • JupyterHub version 1.0.0

  • Livy version 0.6.0

  • MXNet version 1.5.1

  • Oozie version 5.1.0

  • Phoenix version 5.0.0

  • Presto version 0.230

  • Spark version 2.4.4

  • TensorFlow version 1.14.0

  • Zeppelin version 0.9.0-SNAPSHOT

  • Zookeeper version 3.4.14

  • Connectors and drivers: DynamoDB Connector 4.14.0

Note

Flink, Sqoop, Pig, and Mahout are not available in Amazon EMR version 6.0.0.

New Features

  • YARN Docker Runtime Support - YARN applications, such as Spark jobs, can now run in the context of a Docker container. This allows you to easily define dependencies in a Docker image without the need to install custom libraries on your Amazon EMR cluster. For more information, see Configure Docker Integration and Run Spark applications with Docker using Amazon EMR 6.0.0.

  • Hive LLAP Support - Hive now supports the LLAP execution mode for improved query performance. For more information, see Using Hive LLAP.

Changes, Enhancements, and Resolved Issues

  • Amazon Linux

    • Amazon Linux 2 is the operating system for the EMR 6.x release series.

    • systemd is used for service management instead of upstart used in Amazon Linux 1.

  • Java Development Kit (JDK)

    • Coretto JDK 8 is the default JDK for the EMR 6.x release series.

  • Scala

    • Scala 2.12 is used with Apache Spark and Apache Livy.

  • Python 3

    • Python 3 is now the default version of Python in EMR.

  • YARN node labels

    • Beginning with Amazon EMR 6.x release series, the YARN node labels feature is disabled by default. The application master processes can run on both core and task nodes by default. You can enable the YARN node labels feature by configuring following properties: yarn.node-labels.enabled and yarn.node-labels.am.default-node-label-expression. For more information, see Understanding Master, Core, and Task Nodes.

Known Issues

  • Spark interactive shell, including PySpark, SparkR, and spark-shell, does not support using Docker with additional libraries.

  • To use Python 3 with Amazon EMR version 6.0.0, you must add PATH to yarn.nodemanager.env-whitelist.

  • The Live Long and Process (LLAP) functionality is not supported when you use the AWS Glue Data Catalog as the metastore for Hive.

  • When using Amazon EMR 6.0.0 with Spark and Docker integration, you need to configure the instances in your cluster with the same instance type and the same amount of EBS volumes to avoid failure when submitting a Spark job with Docker runtime.

  • In Amazon EMR 6.0.0, HBase on Amazon S3 storage mode is impacted by the HBASE-24286. issue. HBase master cannot initialize when the cluster is created using existing S3 data.

Release 5.30.1 (Latest version of Amazon EMR 5.x series)

New Amazon EMR release versions are made available in different regions over a period of several days, beginning with the first region on the initial release date. The latest release version may not be available in your region during this period.

The following release notes include information for Amazon EMR release version 5.30.1. Changes are relative to 5.30.0.

Initial release date: June 30, 2020

Changes, Enhancements, and Resolved Issues

  • Fixed issue where instance controller process spawned infinite number of processes.

  • Fixed issue where Hue was unable to run an Hive query, showing a "database is locked" message and preventing the execution of queries.

  • Fixed a Spark issue to enable more tasks to run concurrently on the EMR cluster.

  • Fixed a Jupyter notebook issue causing a "too many files open error" in the Jupyter server.

  • Fixed an issue with cluster start times.

New Features

  • Tez UI and YARN timeline server persistent application interfaces are available with Amazon EMR versions 6.x, and EMR version 5.30.1 and later. One-click link access to persistent application history lets you quickly access job history without setting up a web proxy through an SSH connection. Logs for active and terminated clusters are available for 30 days after the application ends. For more information, see View Persistent Application User Interfaces in the Amazon EMR Management Guide.

Known Issues

The feature that allows you to install additional Python libraries and kernels on the master node of the cluster is disabled by default on EMR version 5.30.1. To enable the feature on an EMR 5.30.1 cluster, do these two things:

  • Make sure that the EMR notebook service role includes the following permission:

    elasticmapreduce:ListSteps

    For more information, see Service Role for EMR Notebooks.

  • Run an EMR notebook setup step on the 5.30.1 cluster. See the command line interface (CLI) code below to add the step. For more information about the kernel on cluster feature, see Installing Kernels and Python Libraries on a Cluster Master Node. For information about adding a step using the command line interface, see Adding Steps to a Cluster Using the AWS CLI.

    Use the command line syntax below to run a step to enable installing Python libraries and kernels on a 5.30.1 cluster.

    aws emr add-steps --cluster-id <cluster id> --steps 'Type=CUSTOM_JAR,Name=EMRNotebooksSetup,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://awssupportdatasvcs.com/bootstrap-actions/EMRNotebooksSetup/emr-notebooks-setup.sh"]'