Upload data to Amazon S3 Express One Zone
Overview
With Amazon EMR 6.15.0 and higher, you can use Amazon EMR with Apache Spark in conjunction with the Amazon S3 Express One Zone storage class for improved performance on your Spark jobs. Amazon EMR releases 7.2.0 and higher also support HBase, Flink, and Hive, so you can also benefit from S3 Express One Zone if you use these applications. S3 Express One Zone is an S3 storage class for applications that frequently access data with hundreds of thousands of requests per second. At the time of its release, S3 Express One Zone delivers the lowest latency and highest performance cloud object storage in Amazon S3.
Prerequisites
-
S3 Express One Zone permissions – When S3 Express One Zone initially performs an action like
GET
,LIST
, orPUT
on an S3 object, the storage class callsCreateSession
on your behalf. Your IAM policy must allow thes3express:CreateSession
permission so that the S3A connector can invoke theCreateSession
API. For an example policy with this permission, see Getting started with Amazon S3 Express One Zone. -
S3A connector – To configure your Spark cluster to access data from an Amazon S3 bucket that uses the S3 Express One Zone storage class, you must use the Apache Hadoop connector S3A. To use the connector, ensure all S3 URIs use the
s3a
scheme. If they don’t, you can change the filesystem implementation that you use fors3
ands3n
schemes.
To change the s3
scheme, specify the following cluster
configurations:
[ { "Classification": "core-site", "Properties": { "fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem", "fs.AbstractFileSystem.s3.impl": "org.apache.hadoop.fs.s3a.S3A" } } ]
To change the s3n
scheme, specify the following cluster
configurations:
[ { "Classification": "core-site", "Properties": { "fs.s3n.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem", "fs.AbstractFileSystem.s3n.impl": "org.apache.hadoop.fs.s3a.S3A" } } ]
Getting started with Amazon S3 Express One Zone
Create a permission policy
Before you can create a cluster that uses Amazon S3 Express One Zone, you must create an IAM policy to attach to the Amazon EC2 instance profile for the cluster. The policy must have permissions to access the S3 Express One Zone storage class. The following example policy shows how to grant the required permission. After you create the policy, attach the policy to the instance profile role that you use to create your EMR cluster, as described in the Create and configure your cluster section.
{ "Version":"2012-10-17", "Statement": [ { "Effect": "Allow", "Resource": "arn:aws:s3express:
region-code
:account-id
:bucket/amzn-s3-demo-bucket", "Action": [ "s3express:CreateSession" ] } ] }
Create and configure your cluster
Next, create a cluster that runs Spark, HBase, Flink, or Hive with S3 Express One Zone. The following steps describe a high-level overview to create a cluster in the AWS Management Console:
-
Navigate to the Amazon EMR console and select Clusters from the sidebar. Then choose Create cluster.
-
If you use Spark, select Amazon EMR release
emr-6.15.0
or higher. If you use HBase, Flink, or Hive, selectemr-7.2.0
or higher. -
Select the applications that you want to include on your cluster, such as Spark, HBase, or Flink.
-
To enable Amazon S3 Express One Zone, enter a configuration similar to the following example in the Software settings section. The configurations and recommended values are described in the Configurations overview section that follows this procedure.
[ { "Classification": "core-site", "Properties": { "fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider", "fs.s3a.change.detection.mode": "none", "fs.s3a.endpoint.region": "
aa-example-1
", "fs.s3a.select.enabled": "false" } }, { "Classification": "spark-defaults", "Properties": { "spark.sql.sources.fastS3PartitionDiscovery.enabled": "false" } } ] -
In the EC2 instance profile for Amazon EMR section, choose to use an existing role, and use a role with the policy attached that you created in the Create a permission policy section above.
-
Configure the rest of your cluster settings as appropriate for your application, and then select Create cluster.
Configurations overview
The following tables describe the configurations and suggested values that you should specify when you set up a cluster that uses S3 Express One Zone with Amazon EMR, as described in the Create and configure your cluster section.
S3A configurations
Parameter | Default value | Suggested value | Explanation |
---|---|---|---|
|
If not specified, uses
|
|
The Amazon EMR instance profile role should have the
policy that allows the S3A filesystem
to call |
|
null |
The AWS Region where you created the bucket. |
Region resolution logic doesn't work with S3 Express One Zone storage class. |
|
|
|
Amazon S3 |
|
|
none |
Change detection by S3A works by
checking MD5-based
|
Spark configurations
Parameter | Default value | Suggested value | Explanation |
---|---|---|---|
|
|
false
|
The internal optimization uses an S3 API parameter that the S3 Express One Zone storage class doesn't support. |
Considerations
Consider the following when you integrate Apache Spark on Amazon EMR with the S3 Express One Zone storage class:
-
The S3A connector is required to use S3 Express One Zone with Amazon EMR. Only S3A has the features and storage classes that are required to interact with S3 Express One Zone. For steps to set up the connector, see Prerequisites.
-
The Amazon S3 Express One Zone storage class is only supported with Spark on an Amazon EMR cluster that runs on Amazon EC2.
-
The Amazon S3 Express One Zone storage class only supports SSE-S3 encryption. For more information, see Server-side encryption with Amazon S3 managed keys (SSE-S3).
-
The Amazon S3 Express One Zone storage class does not support writes with the S3A
FileOutputCommitter
. Writes with the S3AFileOutputCommitter
on S3 Express One Zone buckets result in an error: InvalidStorageClass: The storage class you specified is not valid. -
Amazon S3 Express One Zone is supported with Amazon EMR releases 6.15.0 and higher on EMR on EC2. Additionally, it's supported on Amazon EMR releases 7.2.0 and higher on Amazon EMR on EKS and on Amazon EMR Serverless.