What's New? - Amazon EMR

What's New?

This topic covers features and issues resolved in the current release of Amazon EMR 6.x series and 5.x series. These release notes are also available on the Release 6.1.0 Tab and Release 5.30.1 Tab, along with the application versions, component versions, and available configuration classifications for this release.

Subscribe to the RSS feed for Amazon EMR release notes at https://docs.aws.amazon.com/emr/latest/ReleaseGuide/amazon-emr-release-notes.rss to receive updates when a new Amazon EMR release version is available.

For earlier release notes going back to release version 4.2.0, see Amazon EMR What's New History.

Note

Twenty-five previous Amazon EMR release versions now use AWS Signature Version 4 to authenticate requests to Amazon S3. The use of AWS Signature version 2 is being phased out and new S3 buckets created after June 24, 2020 will not support Signature Version 2 signed requests. Existing buckets will continue to support Signature Version 2. We recommend migrating to an Amazon EMR release that supports Signature Version 4 so you can continue accessing new S3 buckets and avoid any potential interruption to your workloads.

The following EMR releases are now available that supports Signature Version 4: emr-4.7.4, emr-4.8.5, emr-4.9.6, emr-4.10.1, emr-5.1.1, emr-5.2.3, emr-5.3.2, emr-5.4.1, emr-5.5.4, emr-5.6.1, emr-5.7.1, emr-5.8.3, emr-5.9.1, emr-5.10.1, emr-5.11.4, emr-5.12.3, emr-5.13.1, emr-5.14.2, emr-5.15.1, emr-5.16.1, emr-5.17.2, emr-5.18.1, emr-5.19.1, emr-5.20.1, and emr-5.21.2. EMR version 5.22.0 and later already support Signature Version 4.

You do not need to change your application code to use Signature Version 4 if you are using Amazon EMR applications, such as Apache Spark, Apache Hive, Presto, etc. If you are using custom applications, which are not included with Amazon EMR, you may need to update your code to use Signature Version 4. For more information about what updates may be required, see Moving from Signature Version 2 to Signature Version 4.

Release 6.1.0 (Latest version of Amazon EMR 6.x series)

New Amazon EMR release versions are made available in different regions over a period of several days, beginning with the first region on the initial release date. The latest release version may not be available in your region during this period.

The following release notes include information for Amazon EMR release version 6.1.0. Changes are relative to 6.0.0.

Initial release date: Sept 04, 2020

Supported Applications

  • AWS SDK for Java version 1.11.828

  • Ganglia version 3.7.2

  • Hadoop version 3.2.1-amzn-1

  • HBase version 2.2.5

  • HBase-operator-tools 1.0.0

  • HCatalog version 3.1.2-amzn-0

  • Hive version 3.1.2-amzn-1

  • Hudi version 0.5.2-incubating

  • Hue version 4.7.1

  • JupyterHub version 1.1.0

  • Livy version 0.7.0

  • MXNet version 1.6.0

  • Oozie version 5.2.0

  • Phoenix version 5.0.0

  • Presto version 0.232

  • PrestoSQL version 338

  • Spark version 3.0.0

  • TensorFlow version 2.1.0

  • Zeppelin version 0.9.0-preview1

  • Zookeeper version 3.4.14

  • Connectors and drivers: DynamoDB Connector 4.14.0

New Features

  • Managed Scaling – With Amazon EMR version 6.1.0, you can enable EMR managed scaling to automatically increase or decrease the number of instances or units in your cluster based on workload. EMR continuously evaluates cluster metrics to make scaling decisions that optimize your clusters for cost and speed. Managed Scaling is also available on Amazon EMR version 5.30.0 and later, except 6.0.0. For more information, see Scaling Cluster Resources in the Amazon EMR Management Guide.

  • PrestoSQL version 338 is supported with EMR 6.1.0. For more information, see Configure Docker Integration.

    • PrestoSQL is supported on EMR 6.1.0 and later versions only, not on EMR 6.0.0 or EMR 5.x.

    • The application name, Presto continues to be used to install PrestoDB on clusters. To install PrestoSQL on clusters, use the application name PrestoSQL.

    • You can install either PrestoDB or PrestoSQL, but you cannot install both on a single cluster. If both PrestoDB and PresoSQL are specified when attempting to create a cluster, a validation error occurs and the cluster creation request fails.

    • PrestoSQL is supported on both single-master and muti-master clusters. On multi-master clusters, an external Hive metastore is required to run PrestoSQL or PrestoDB. See Supported Applications in an EMR Cluster with Multiple Master Nodes.

  • ECR auto authentication support on Apache Hadoop and Apache Spark with Docker: Spark users can use Docker images from Docker Hub and Amazon Elastic Container Registry (Amazon ECR) to define environment and library dependencies.

    Configure Docker and Run Spark Applications with Docker Using Amazon EMR 6.x.

  • EMR supports Apache Hive ACID transactions: Amazon EMR 6.1.0 adds support for Hive ACID transactions so it complies with the ACID properties of a database. With this feature, you can run INSERT, UPDATE, DELETE, and MERGE operations in Hive managed tables with data in Amazon Simple Storage Service (Amazon S3). This is a key feature for use cases like streaming ingestion, data restatement, bulk updates using MERGE, and slowly changing dimensions. For more information, including configuration examples and use cases, see Amazon EMR supports Apache Hive ACID transactions.

Changes, Enhancements, and Resolved Issues

  • Apache Flink is not supported on EMR 6.0.0, but it is supported on EMR 6.1.0 with Flink 1.11.0. This is the first version of Flink to officially support Hadoop 3. See Apache Flink 1.11.0 Release Announcement.

  • Ganglia has been removed from default EMR 6.1.0 package bundles.

Known Issues

  • If you set custom garbage collection configuration with spark.driver.extraJavaOptions and spark.executor.extraJavaOptions, this will result in driver/executor launch failure with EMR 6.1 due to conflicting garbage collection configuration. With EMR Release 6.1.0, you should specify custom Spark garbage collection configuration for drivers and executors with the properties spark.driver.defaultJavaOptions and spark.executor.defaultJavaOptions instead. Read more in Apache Spark Runtime Environment and Configuring Spark Garbage Collection on Amazon EMR 6.1.0.

  • Using Pig with Oozie (and within Hue, since Hue uses Oozie actions to run Pig scripts), generates an error that a native-lzo library cannot be loaded. This error message is informational and does not block Pig from running.

  • Hudi Concurrency Support: Currently Hudi doesn’t support concurrent writes to a single Hudi table. In addition, Hudi rolls back any changes being done by in-progress writers before allowing a new writer to start. Concurrent writes can interfere with this mechanism and introduce race conditions, which can lead to data corruption. You should ensure that as part of your data processing workflow, there is only a single Hudi writer operating against a Hudi table at any time. Hudi does support multiple concurrent readers operating against the same Hudi table.

Release 5.30.1 (Latest version of Amazon EMR 5.x series)

New Amazon EMR release versions are made available in different regions over a period of several days, beginning with the first region on the initial release date. The latest release version may not be available in your region during this period.

The following release notes include information for Amazon EMR release version 5.30.1. Changes are relative to 5.30.0.

Initial release date: June 30, 2020

Last updated date: August 24, 2020

Changes, Enhancements, and Resolved Issues

  • Fixed issue where instance controller process spawned infinite number of processes.

  • Fixed issue where Hue was unable to run an Hive query, showing a "database is locked" message and preventing the execution of queries.

  • Fixed a Spark issue to enable more tasks to run concurrently on the EMR cluster.

  • Fixed a Jupyter notebook issue causing a "too many files open error" in the Jupyter server.

  • Fixed an issue with cluster start times.

New Features

  • Tez UI and YARN timeline server persistent application interfaces are available with Amazon EMR versions 6.x, and EMR version 5.30.1 and later. One-click link access to persistent application history lets you quickly access job history without setting up a web proxy through an SSH connection. Logs for active and terminated clusters are available for 30 days after the application ends. For more information, see View Persistent Application User Interfaces in the Amazon EMR Management Guide.

  • EMR Notebook execution APIs are available to execute EMR notebooks via a script or command line. The ability to start, stop, list, and describe EMR notebook executions without the AWS console enables you programmatically control an EMR notebook. Using a parameterized notebook cell, you can pass different parameter values to a notebook without having to create a copy of the notebook for each new set of paramter values. See EMR API Actions. For sample code, see Sample commands to execute EMR Notebooks programmatically.

Known Issues

The feature that allows you to install additional Python libraries and kernels on the master node of the cluster is disabled by default on EMR version 5.30.1. To enable the feature on an EMR 5.30.1 cluster, do these two things:

  • Make sure that the EMR notebook service role includes the following permission:

    elasticmapreduce:ListSteps

    For more information, see Service Role for EMR Notebooks.

  • Run an EMR notebook setup step on the 5.30.1 cluster. See the command line interface (CLI) code below to add the step. For more information about the kernel on cluster feature, see Installing Kernels and Python Libraries on a Cluster Master Node. For information about adding a step using the command line interface, see Adding Steps to a Cluster Using the AWS CLI.

    Use the command line syntax below to run a step to enable installing Python libraries and kernels on a 5.30.1 cluster.

    aws emr add-steps --cluster-id <cluster id> --steps 'Type=CUSTOM_JAR,Name=EMRNotebooksSetup,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://awssupportdatasvcs.com/bootstrap-actions/EMRNotebooksSetup/emr-notebooks-setup.sh"]'