Upload data to Amazon S3 - Amazon EMR

Upload data to Amazon S3

For information on how to upload objects to Amazon S3, see Add an object to your bucket in the Amazon Simple Storage Service User Guide. For more information about using Amazon S3 with Hadoop, see http://wiki.apache.org/hadoop/AmazonS3.

Create and configure an Amazon S3 bucket

Amazon EMR uses the AWS SDK for Java with Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as buckets. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements. For more information, see Bucket restrictions and limitations in the Amazon Simple Storage Service User Guide.

This section shows you how to use the Amazon S3 AWS Management Console to create and then set permissions for an Amazon S3 bucket. You can also create and set permissions for an Amazon S3 bucket using the Amazon S3 API or AWS CLI. You can also use curl along with a modification to pass the appropriate authentication parameters for Amazon S3.

See the following resources:

Note

If you enable logging for a bucket, it enables only bucket access logs, not Amazon EMR cluster logs.

During bucket creation or after, you can set the appropriate permissions to access the bucket depending on your application. Typically, you give yourself (the owner) read and write access and give authenticated users read access.

Required Amazon S3 buckets must exist before you can create a cluster. You must upload any required scripts or data referenced in the cluster to Amazon S3. The following table describes example data, scripts, and log file locations.

Configure multipart upload for Amazon S3

Amazon EMR supports Amazon S3 multipart upload through the AWS SDK for Java. Multipart upload lets you upload a single object as a set of parts. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts and creates the object.

For more information, see Multipart upload overview in the Amazon Simple Storage Service User Guide.

In addition, Amazon EMR offers properties that allow you to more precisely control the clean-up of failed multipart upload parts.

The following table describes the Amazon EMR configuration properties for multipart upload. You can configure these using the core-site configuration classification. For more information, see Configure applications in the Amazon EMR Release Guide.

Configuration parameter name Default value Description
fs.s3n.multipart.uploads.enabled true A Boolean type that indicates whether to enable multipart uploads. When EMRFS consistent view is enabled, multipart uploads are enabled by default and setting this value to false is ignored.
fs.s3n.multipart.uploads.split.size 134217728

Specifies the maximum size of a part, in bytes, before EMRFS starts a new part upload when multipart uploads is enabled. The minimum value is 5242880 (5 MB). If a lesser value is specified, 5242880 is used. The maximum is 5368709120 (5 GB). If a greater value is specified, 5368709120 is used.

If EMRFS client-side encryption is disabled and the Amazon S3 Optimized Committer is also disabled, this value also controls the maximum size that a data file can grow until EMRFS uses multipart uploads rather than a PutObject request to upload the file. For more information, see

fs.s3n.ssl.enabled true A Boolean type that indicates whether to use http or https.
fs.s3.buckets.create.enabled false A Boolean type that indicates whether a bucket should be created if it does not exist. Setting to false causes an exception on CreateBucket operations.
fs.s3.multipart.clean.enabled false A Boolean type that indicates whether to enable background periodic clean-up of incomplete multipart uploads.
fs.s3.multipart.clean.age.threshold 604800 A long type that specifies the minimum age of a multipart upload, in seconds, before it is considered for cleanup. The default is one week.
fs.s3.multipart.clean.jitter.max 10000 An integer type that specifies the maximum amount of random jitter delay in seconds added to the 15-minute fixed delay before scheduling next round of clean-up.

Disable multipart uploads

Console
To disable multipart uploads with the console
  1. Sign in to the AWS Management Console, and open the Amazon EMR console at https://console.aws.amazon.com/emr.

  2. Under EMR on EC2 in the left navigation pane, choose Clusters, and then choose Create cluster.

  3. Under Software settings, enter the following configuration: classification=core-site,properties=[fs.s3n.multipart.uploads.enabled=false].

  4. Choose any other options that apply to your cluster.

  5. To launch your cluster, choose Create cluster.

CLI
To disable multipart upload using the AWS CLI

This procedure explains how to disable multipart upload using the AWS CLI. To disable multipart upload, type the create-cluster command with the --bootstrap-actions parameter.

  1. Create a file, myConfig.json, with the following contents and save it in the same directory where you run the command:

    [ { "Classification": "core-site", "Properties": { "fs.s3n.multipart.uploads.enabled": "false" } } ]
  2. Type the following command and replace myKey with the name of your EC2 key pair.

    Note

    Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

    aws emr create-cluster --name "Test cluster" \ --release-label emr-7.0.0 --applications Name=Hive Name=Pig \ --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge \ --instance-count 3 --configurations file://myConfig.json
API
To disable multipart upload using the API

Best practices

The following are recommendations for using Amazon S3 buckets with EMR clusters.

Enable versioning

Versioning is a recommended configuration for your Amazon S3 bucket. By enabling versioning, you ensure that even if data is unintentionally deleted or overwritten it can be recovered. For more information, see Using versioning in the Amazon Simple Storage Service User Guide.

Clean up failed multipart uploads

EMR cluster components use multipart uploads via the AWS SDK for Java with Amazon S3 APIs to write log files and output data to Amazon S3 by default. For information about changing properties related to this configuration using Amazon EMR, see Configure multipart upload for Amazon S3. Sometimes the upload of a large file can result in an incomplete Amazon S3 multipart upload. When a multipart upload is unable to complete successfully, the in-progress multipart upload continues to occupy your bucket and incurs storage charges. We recommend the following options to avoid excessive file storage:

  • For buckets that you use with Amazon EMR, use a lifecycle configuration rule in Amazon S3 to remove incomplete multipart uploads three days after the upload initiation date. Lifecycle configuration rules allow you to control the storage class and lifetime of objects. For more information, see Object lifecycle management, and Aborting incomplete multipart uploads using a bucket lifecycle policy.

  • Enable Amazon EMR's multipart cleanup feature by setting fs.s3.multipart.clean.enabled to true and tuning other cleanup parameters. This feature is useful at high volume, large scale, and with clusters that have limited uptime. In this case, the DaysAfterIntitiation parameter of a lifecycle configuration rule may be too long, even if set to its minimum, causing spikes in Amazon S3 storage. Amazon EMR's multipart cleanup allows more precise control. For more information, see Configure multipart upload for Amazon S3.

Manage version markers

We recommend that you enable a lifecycle configuration rule in Amazon S3 to remove expired object delete markers for versioned buckets that you use with Amazon EMR. When deleting an object in a versioned bucket, a delete marker is created. If all previous versions of the object subsequently expire, an expired object delete marker is left in the bucket. While you are not charged for delete markers, removing expired markers can improve the performance of LIST requests. For more information, see Lifecycle configuration for a bucket with versioning in the Amazon Simple Storage Service User Guide.

Performance best practices

Depending on your workloads, specific types of usage of EMR clusters and applications on those clusters can result in a high number of requests against a bucket. For more information, see Request rate and performance considerations in the Amazon Simple Storage Service User Guide.