What's new? - Amazon EMR

What's new?

This page describes the changes and functionality available in the latest releases of Amazon EMR 6.x and Amazon EMR 5.x. These release notes are also available on the Amazon EMR release 6.9.0 page and Amazon EMR release 5.36.0 page, along with the application versions, component versions, and available configuration classifications for each release.

Subscribe to the RSS feed for Amazon EMR release notes at https://docs.aws.amazon.com/emr/latest/ReleaseGuide/amazon-emr-release-notes.rss to receive updates when a new Amazon EMR release is available.

For release notes from prior releases, see the Amazon EMR archive of release notes.

Note

Amazon EMR releases now use AWS Signature Version 4 (SigV4) to authenticate requests to Amazon S3. We recommend that you use an Amazon EMR release that supports SigV4 so that you can access new S3 buckets and avoid interruption to your workloads. For more information and a list of Amazon EMR releases that support SigV4, see Amazon EMR and AWS Signature Version 4.

Amazon EMR 6.9.0 (latest release of 6.x series)

New Amazon EMR releases are made available in different Regions over a period of several days, beginning with the first Region on the initial release date. The latest release version may not be available in your Region during this period.

The following release notes include information for Amazon EMR release 6.9.0. Changes are relative to Amazon EMR release 6.8.0. For information on the release timeline, see the change log.

New Features
  • Amazon EMR release 6.9.0 supports Apache Spark RAPIDS 22.08.0, Apache Hudi 0.12.1, Apache Iceberg 0.14.1, Trino 398, and Tez 0.10.2.

  • Amazon EMR release 6.9.0 includes a new open-source application, Delta Lake 2.1.0.

  • The Amazon Redshift integration for Apache Spark is included in Amazon EMR releases 6.9.0 and later. Previously an open-source tool, the native integration is a Spark connector that you can use to build Apache Spark applications that read from and write to data in Amazon Redshift and Amazon Redshift Serverless. For more information, see Using Amazon Redshift integration for Apache Spark with Amazon EMR.

  • Amazon EMR release 6.9.0 adds support for archiving logs to Amazon S3 during cluster scale-down. Previously, you could only archive log files to Amazon S3 during cluster termination. The new capability ensures that log files generated on the cluster persist on Amazon S3 even after the node is terminated. For more information, see Configure cluster logging and debugging.

  • To support long running queries, Trino now includes a fault-tolerant execution mechanism. Fault-tolerant execution mitigates query failures by retrying failed queries or their component tasks. For more information, see Fault-tolerant execution in Trino.

  • You can use Apache Flink on Amazon EMR for unified BATCH and STREAM processing of Apache Hive Tables or metadata of any Flink tablesource such as Iceberg, Kinesis or Kafka. You can specify the AWS Glue Data Catalog as the metastore for Flink using the AWS Management Console, AWS CLI, or Amazon EMR API. For more information, see Configuring Flink.

  • You can now specify AWS Identity and Access Management (IAM) runtime roles and AWS Lake Formation-based access control for Apache Spark, Apache Hive, and Presto queries on Amazon EMR on EC2 clusters with Amazon SageMaker Studio. For more information, see Configure runtime roles for Amazon EMR steps.

Known Issues
  • For Amazon EMR release 6.9.0, Trino does not work on clusters enabled for Apache Ranger. If you need to use Trino with Ranger, contact AWS Support.

  • If you use the the Amazon Redshift integration for Apache Spark and have a time, timetz, timestamp, or timestamptz with microsecond precision in Parquet format, the connector rounds the time values to the nearest millisecond value. As a workaround, use the text unload format unload_s3_format parameter.

  • When you use Spark with Hive partition location formatting to read data in Amazon S3, and you run Spark on Amazon EMR releases 5.30.0 to 5.36.0, and 6.2.0 to 6.9.0, you might encounter an issue that prevents your cluster from reading data correctly. This can happen if your partitions have all of the following characteristics:

    • Two or more partitions are scanned from the same table.

    • At least one partition directory path is a prefix of at least one other partition directory path, for example, s3://bucket/table/p=a is a prefix of s3://bucket/table/p=a b.

    • The first character that follows the prefix in the other partition directory has a UTF-8 value that’s less than than the / character (U+002F). For example, the space character (U+0020) that occurs between a and b in s3://bucket/table/p=a b falls into this category. Note that there are 14 other non-control characters: !“#$%&‘()*+,-. For more information, see UTF-8 encoding table and Unicode characters.

    As a workaround to this issue, set the spark.sql.sources.fastS3PartitionDiscovery.enabled configuration to false in the spark-defaults classification.

  • Connections to Amazon EMR clusters from Amazon SageMaker Studio may intermittently fail with a 403 Forbidden response code. This error happens when setup of the IAM role on the cluster takes longer than 60 seconds. As a workaround, you can install an Amazon EMR patch to enable retries and increase the timeout to a minimum of 300 seconds. Use the following steps to apply the bootstrap action when you launch your cluster.

    1. Download the bootstrap script and RPM files from Amazon S3 using the following URIs. Replace regionName with the AWS Region where you plan to launch the cluster.

      s3://emr-data-access-control-regionName/customer-bootstrap-actions/gcsc/replace-rpms.sh s3://emr-data-access-control-regionName/customer-bootstrap-actions/gcsc/emr-secret-agent-1.18.0-SNAPSHOT20221121212949.noarch.rpm
    2. Upload the files from the previous step to an Amazon S3 bucket that you own. The bucket must be in the same AWS Region where you plan to launch the cluster.

    3. Include the following bootstrap action when you launch your EMR cluster. Replace bootstrap_URI and RPM_URI with the corresponding URIs from Amazon S3.

      --bootstrap-actions "Path=bootstrap_URI,Args=[RPM_URI]"
  • Apache Flink provides Native S3 FileSystem and Hadoop FileSystem Connectors, which let applications create a FileSink and write the data into Amazon S3. This FileSink fails with one of the following two exceptions.

    java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS
    Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.io.retry.RetryPolicies.retryOtherThanRemoteAndSaslException(Lorg/apache/hadoop/io/retry/RetryPolicy;Ljava/util/Map;)Lorg/apache/hadoop/io/retry/RetryPolicy; at org.apache.hadoop.yarn.client.RMProxy.createRetryPolicy(RMProxy.java:302) ~[hadoop-yarn-common-3.3.3-amzn-0.jar:?]

    As a workaround, you can install an Amazon EMR patch, which fixes the above issue in Flink. To apply the bootstrap action when you launch your cluster, complete the following steps.

    1. Download the flink-rpm to your Amazon S3 bucket. Your RPM path is s3://DOC-EXAMPLE-BUCKET/rpms/flink/.

    2. Download the bootstrap script and RPM files from Amazon S3 using the following URI. Replace regionName with the AWS Region where you plan to launch the cluster.

      s3://emr-data-access-control-regionName/customer-bootstrap-actions/gcsc/replace-rpms.sh
Changes, Enhancements, and Resolved Issues
  • When you use the DynamoDB connector with Spark on Amazon EMR versions 6.6.0, 6.7.0, and 6.8.0, all reads from your table return an empty result, even though the input split references non-empty data. Amazon EMR release 6.9.0 fixes this issue.

  • With Amazon EMR release 6.6.0 and later, when you launch new Amazon EMR clusters with the default Amazon Linux (AL) AMI option, Amazon EMR automatically uses the latest Amazon Linux AMI. In earlier releases, Amazon EMR does not update the Amazon Linux AMIs after the initial release. See Using the default Amazon Linux AMI for Amazon EMR.

    OsReleaseLabel (Amazon Linux version) Amazon Linux kernel version Available date Supported Regions
    2.0.20221210.1 4.14.301 January 12, 2023 us‑east‑1, us‑east‑2, us‑west‑1, us‑west‑2, eu‑north‑1, eu‑west‑1, eu‑west‑2, eu‑west‑3, eu‑central‑1, eu‑south‑1, ap‑east‑1, ap‑south‑1, ap‑southeast‑3, ap‑northeast‑1, ap‑northeast‑2, ap‑northeast‑3, ap‑southeast‑1, ap‑southeast‑2, af‑south‑1, sa‑east‑1, me‑south‑1, ca‑central‑1
    2.0.20221103.3 4.14.296 December 5, 2022 us-east-1, us-east-2, us-west-1, us-west-2, eu-north-1, eu-west-1, eu-west-2, eu-west-3, eu-central-1, eu-south-1, ap-east-1, ap-south-1, ap-southeast-3, ap-northeast-1, ap-northeast-2, ap-northeast-3, ap-southeast-1, ap-southeast-2, af-south-1, sa-east-1, me-south-1, ca-central-1

Amazon EMR 5.36.0 (latest release of 5.x series)

New Amazon EMR releases are made available in different Regions over a period of several days, beginning with the first Region on the initial release date. The latest release version may not be available in your Region during this period.

The following release notes include information for Amazon EMR release 5.36.0. Changes are relative to 5.35.0.

Initial release date: June 15, 2022

New Features
  • Amazon EMR release 5.36.0 adds support for data definition language (DDL) with Apache Spark on Apache Ranger enabled clusters. This allows you to use Apache Ranger for managing access for operations like creating, altering and dropping databases and tables from an Amazon EMR cluster.

  • Amazon EMR 5.36.0 supports automatic Amazon Linux updates for clusters using a default AMI. See Using the default Amazon Linux AMI for Amazon EMR.

    OsReleaseLabel (Amazon Linux Version) Amazon Linux Kernel Version Available Date Supported Regions
    2.0.20221210.1 4.14.301 January 12, 2023 us‑east‑1, us‑east‑2, us‑west‑1, us‑west‑2, eu‑north‑1, eu‑west‑1, eu‑west‑2, eu‑west‑3, eu‑central‑1, eu‑south‑1, ap‑east‑1, ap‑south‑1, ap‑southeast‑3, ap‑northeast‑1, ap‑northeast‑2, ap‑northeast‑3, ap‑southeast‑1, ap‑southeast‑2, af‑south‑1, sa‑east‑1, me‑south‑1, ca‑central‑1
    2.0.20221103.3 4.14.296 December 5, 2022 us‑east‑1, us‑east‑2, us‑west‑1, us‑west‑2, eu‑north‑1, eu‑west‑1, eu‑west‑2, eu‑west‑3, eu‑central‑1, eu‑south‑1, ap‑east‑1, ap‑south‑1, ap‑southeast‑3, ap‑northeast‑1, ap‑northeast‑2, ap‑northeast‑3, ap‑southeast‑1, ap‑southeast‑2, af‑south‑1, sa‑east‑1, me‑south‑1, ca‑central‑1
    2.0.20221004.0 4.14.294 November 2, 2022 us‑east‑1, us‑east‑2, us‑west‑1, us‑west‑2, eu‑north‑1, eu‑west‑1, eu‑west‑2, eu‑west‑3, eu‑central‑1, eu‑south‑1, ap‑east‑1, ap‑south‑1, ap‑southeast‑3, ap‑northeast‑1, ap‑northeast‑2, ap‑northeast‑3, ap‑southeast‑1, ap‑southeast‑2, af‑south‑1, sa‑east‑1, me‑south‑1, ca‑central‑1
    2.0.20220912.1 4.14.291 October 7, 2022 us‑east‑1, us‑east‑2, us‑west‑1, us‑west‑2, eu‑north‑1, eu‑west‑1, eu‑west‑2, eu‑west‑3, eu‑central‑1, eu‑south‑1, ap‑east‑1, ap‑south‑1, ap‑southeast‑3, ap‑northeast‑1, ap‑northeast‑2, ap‑northeast‑3, ap‑southeast‑1, ap‑southeast‑2, af‑south‑1, sa‑east‑1, me‑south‑1, ca‑central‑1
    2.0.20220719.0 4.14.287 August 10, 2022 us‑west‑1, eu‑west‑3, eu‑north‑1, eu‑central‑1, ap‑south‑1, me‑south‑1
    2.0.20220426.0 4.14.281 June 14, 2022 us‑east‑1, us‑east‑2, us‑west‑1, us‑west‑2, eu‑north‑1, eu‑west‑1, eu‑west‑2, eu‑west‑3, eu‑central‑1, eu‑south‑1, ap‑east‑1, ap‑south‑1, ap‑southeast‑3, ap‑northeast‑1, ap‑northeast‑2, ap‑northeast‑3, ap‑southeast‑1, ap‑southeast‑2, af‑south‑1, sa‑east‑1, me‑south‑1, ca‑central‑1
Changes, Enhancements, and Resolved Issues
  • Amazon EMR 5.36.0 upgrades now support: aws-sdk 1.12.206, Hadoop 2.10.1-amzn-4, Hive 2.3.9-amzn-2, Hudi 0.10.1-amzn-1, Spark 2.4.8-amzn-2, Presto 0.267-amzn-1, Amazon Glue connector 1.18.0, EMRFS 2.51.0.

Known issues
  • When you use Spark with Hive partition location formatting to read data in Amazon S3, and you run Spark on Amazon EMR releases 5.30.0 to 5.36.0, and 6.2.0 to 6.9.0, you might encounter an issue that prevents your cluster from reading data correctly. This can happen if your partitions have all of the following characteristics:

    • Two or more partitions are scanned from the same table.

    • At least one partition directory path is a prefix of at least one other partition directory path, for example, s3://bucket/table/p=a is a prefix of s3://bucket/table/p=a b.

    • The first character that follows the prefix in the other partition directory has a UTF-8 value that’s less than than the / character (U+002F). For example, the space character (U+0020) that occurs between a and b in s3://bucket/table/p=a b falls into this category. Note that there are 14 other non-control characters: !“#$%&‘()*+,-. For more information, see UTF-8 encoding table and Unicode characters.

    As a workaround to this issue, set the spark.sql.sources.fastS3PartitionDiscovery.enabled configuration to false in the spark-defaults classification.

Amazon EMR and AWS Signature Version 4

Amazon EMR releases now use AWS Signature Version 4 (SigV4) to authenticate requests to Amazon S3. Buckets created in Amazon S3 after June 24, 2020 don't support requests signed by Signature Version 2 (SigV2). Buckets created on or before June 24, 2020 will continue to support SigV2. We recommend that you migrate to an Amazon EMR release that supports SigV4 so that you can access new S3 buckets and avoid interruption to your workloads.

If you use applications that are included with Amazon EMR such as Apache Spark, Apache Hive, and Presto, you don't need to change your application code to use SigV4 . If you use custom applications that are not included with Amazon EMR, you might need to update your code to use SigV4. For more information, see Moving from Signature Version 2 to Signature Version 4 in the Amazon S3 User Guide.

The following Amazon EMR releases support SigV4: emr-4.7.4, emr-4.8.5, emr-4.9.6, emr-4.10.1, emr-5.1.1, emr-5.2.3, emr-5.3.2, emr-5.4.1, emr-5.5.4, emr-5.6.1, emr-5.7.1, emr-5.8.3, emr-5.9.1, emr-5.10.1, emr-5.11.4, emr-5.12.3, emr-5.13.1, emr-5.14.2, emr-5.15.1, emr-5.16.1, emr-5.17.2, emr-5.18.1, emr-5.19.1, emr-5.20.1, and emr-5.21.2, and emr-5.22.0 and later.