Menu
Amazon EMR
Management Guide

Upload Data to Amazon S3

For information on how to upload objects to Amazon S3, go to Add an Object to Your Bucket in the Amazon Simple Storage Service Getting Started Guide. For more information about using Amazon S3 with Hadoop, go to http://wiki.apache.org/hadoop/AmazonS3.

Create and Configure an Amazon S3 Bucket

Amazon EMR uses the AWS SDK for Java with Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as buckets. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements. For more information, go to Bucket Restrictions and Limitations in the Amazon Simple Storage Service Developer Guide.

This section shows you how to use the Amazon S3 AWS Management Console to create and then set permissions for an Amazon S3 bucket. However, you can also create and set permissions for an Amazon S3 bucket using the Amazon S3 API or the Curl command line tool. For information about Curl, go to Amazon S3 Authentication Tool for Curl. For information about using the Amazon S3 API to create and configure an Amazon S3 bucket, see the Amazon Simple Storage Service API Reference.

To create an Amazon S3 bucket using the console

  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. Choose Create Bucket.

    The Create a Bucket dialog box opens.

  3. Enter a bucket name, such as myawsbucket.

    This name should be globally unique, and cannot be the same name used by another bucket.

  4. Select the Region for your bucket. To avoid paying cross-region bandwidth charges, create the Amazon S3 bucket in the same region as your cluster.

    Refer to Choose an AWS Region for guidance on choosing a Region.

  5. Choose Create.

You created a bucket with the URI s3n://myawsbucket/.

Note

If you enable logging in the Create a Bucket wizard, it enables only bucket access logs, not Amazon EMR cluster logs.

Note

For more information on specifying Region-specific buckets, see Buckets and Regions in the Amazon Simple Storage Service Developer Guide and Available Region Endpoints for the AWS SDKs.

After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself (the owner) read and write access and authenticated users read access.

To set permissions on an Amazon S3 bucket using the console

  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. In the Buckets pane, open (right-click) the bucket you just created.

  3. Select Properties.

  4. In the Properties pane, select the Permissions tab.

  5. Choose Add more permissions.

  6. Select Authenticated Users in the Grantee field.

  7. To the right of the Grantee drop-down list, select List.

  8. Choose Save.

You have created a bucket and restricted permissions to authenticated users.

Required Amazon S3 buckets must exist before you can create a cluster. You must upload any required scripts or data referenced in the cluster to Amazon S3. The following table describes example data, scripts, and log file locations.

Best practices and recommended Amazon S3 Bucket Configuration

The following are recommendations for using Amazon S3 Buckets with EMR clusters.

Enable Versioning

Versioning is a recommended configuration for your Amazon S3 bucket. By enabling versioning, you ensure that even if data is unintentionally deleted or overwritten it can be recovered. For more information, go to Using Versioning in the Amazon Simple Storage Service Developer Guide.

Lifecycle Management

Amazon S3 Lifecycle Management allows you to create rules which control the storage class and lifetime of your objects. EMR cluster components use multipart uploads via the AWS SDK for Java with Amazon S3 APIs to write log files and output data to Amazon S3. Amazon EMR does not automatically manage incomplete multipart uploads. Sometimes the upload of a large file can result in an incomplete Amazon S3 multipart upload. When a multipart upload is unable to complete successfully, the in-progress multipart upload continues to occupy your bucket and incurs storage charges. It is recommended that for buckets you use with Amazon EMR you enable a rule to remove incomplete multipart uploads three days after the upload initiation date.

When deleting an object in a versioned bucket a delete marker is created. If all previous versions of the object subsequently expire, an expired object delete marker is left within the bucket. While there is no charge for these delete markers, removing the expired delete markers can improve the performance of LIST requests. It is recommended that for versioned buckets you will use with Amazon EMR, you should also enable a rule to remove expired object delete markers. For more information, see Lifecycle Configuration for a Bucket with Versioning in the Amazon Simple Storage Service Console User Guide.

Performance best practices

Depending on your workloads, specific types of usage of EMR clusters and applications on those clusters can result in a high number of requests against your S3 bucket. For more information, go to Request Rate and Performance Considerations in the Amazon Simple Storage Service Developer Guide.

Configure Multipart Upload for Amazon S3

Important

You are responsible for the lifecycle management of your data stored in Amazon S3. For more information about lifecycle management, see Best practices and recommended Amazon S3 Bucket Configuration.

Amazon EMR supports Amazon S3 multipart upload through the AWS SDK for Java. Multipart upload lets you upload a single object as a set of parts. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts and creates the object.

For more information on Amazon S3 multipart uploads, go to Uploading Objects Using Multipart Upload in the Amazon Simple Storage Service Developer Guide.

The Amazon EMR configuration parameters for multipart upload are described in the following table.

Configuration Parameter Name Default Value Description
fs.s3n.multipart.uploads.enabled True A boolean type that indicates whether to enable multipart uploads.
fs.s3n.ssl.enabled True A boolean type that indicates whether to use http or https.
fs.s3.buckets.create.enabled True This setting automatically creates bucket if it doesn't exist. Setting to False will cause an exception on CreateBucket operations for that case.

You modify the configuration parameters for multipart uploads using a bootstrap action.

Disable Multipart Upload Using the Amazon EMR Console

This procedure explains how to disable multipart upload using the Amazon EMR console.

To disable multipart uploads with a bootstrap action using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster.

  3. Choose Go to advanced options.

  4. Choose Edit Software Settings and enter the following configuration: classification=core-site,properties=[fs.s3n.multipart.uploads.enabled=false]

  5. Proceed with creating the cluster as described in Plan and Configure Clusters.

Disable Multipart Upload Using the AWS CLI

This procedure explains how to disable multipart upload using the AWS CLI. To disable multipart upload, type the create-cluster command with the --bootstrap-action parameter.

To disable multipart upload using the AWS CLI

  1. Create a file, myConfig.json, with the following contents:

    Copy
    [ { "Classification": "core-site", "Properties": { "fs.s3n.multipart.uploads.enabled": "false" } } ]
  2. Type the following command and replace myKey with the name of your EC2 key pair.

    Copy
    aws emr create-cluster --name "Test cluster" --release-label emr-4.0.0 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --configurations file://./myConfig.json

Disable Multipart Upload Using the API

For information on using Amazon S3 multipart uploads programmatically, go to Using the AWS SDK for Java for Multipart Upload in the Amazon Simple Storage Service Developer Guide.

For more information about the AWS SDK for Java, go to the AWS SDK for Java detail page.