Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Distributed Copy Using S3DistCp

Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. For more information about the Apache DistCp open source project, go to http://hadoop.apache.org/docs/stable/distcp.html.

S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services (AWS), particularly Amazon Simple Storage Service (Amazon S3). You use S3DistCp by adding it as a step in a cluster. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon Elastic MapReduce (Amazon EMR) cluster. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.

S3DistCp is stored as a JAR file at /home/hadoop/lib/emr-s3distcp-1.0.jar.

Note

If you are using an AMI version older than 2.3.5 and AWS Identity and Access Management (IAM) roles in your cluster, use the version of S3DistCp at s3://sa-east-1.elasticmapreduce/libs/s3distcp/role/s3distcp.jar instead of the location above. For more information, see Configure IAM Roles for Amazon EMR.

If a file already exists in the destination location, S3DistCp overwrites it. This is true for destinations in both Amazon S3 and HDFS.

If S3DistCp is unable to copy some or all of the specified files, the cluster step fails and returns a non-zero error code. If this occurs, S3DistCp does not clean up partially copied files.

S3DistCp Options

When you call S3DistCp, you can specify options that change how it copies and compresses data. These are described in the following table. The options are added to the step using either the --arg or --args syntax, examples of which are shown following the table.

Option Description Required
--src,LOCATION

Location of the data to copy. This can be either an HDFS or Amazon S3 location.

Example: --src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node

Yes
--dest,LOCATION

Destination for the data. This can be either an HDFS or Amazon S3 location.

Example: --dest,hdfs:///output

Yes
--srcPattern,PATTERN

A regular expression that filters the copy operation to a subset of the data at --src. If neither --srcPattern nor --groupBy is specified, all data at --src is copied to --dest.

If the regular expression argument contains special characters, such as an asterisk (*), either the regular expression or the entire --args string must be enclosed in single quotes (').

Example: --srcPattern,.*daemons.*-hadoop-.*

No
--groupBy,PATTERN

A regular expression that causes S3DistCp to concatenate files that match the expression. For example, you could use this option to combine all of the log files written in one hour into a single file. The concatenated filename is the value matched by the regular expression for the grouping.

Parentheses indicate how files should be grouped, with all of the items that match the parenthetical statement being combined into a single output file. If the regular expression does not include a parenthetical statement, the cluster will fail on the S3DistCp step and return an error.

If the regular expression argument contains special characters, such as an asterisk (*), either the regular expression or the entire --args string must be enclosed in single quotes (').

When --groupBy is specified, only files that match the specified pattern will be copied. You do not need to specify --groupBy and --srcPattern at the same time.

Example: --groupBy,.*subnetid.*([0-9]+-[0-9]+-[0-9]+-[0-9]+).*

No
--targetSize,SIZE

The size, in mebibytes (MiB), of the files to create based on the --groupBy option. This value must be an integer. When --targetSize is set, S3DistCp will attempt to match this size; the actual size of the copied files may be larger or smaller than this value.

If the files concatenated by --groupBy are larger than the value of --targetSize, they will be broken up into part files, which will be named sequentially with a numeric value appended to the end. For example, a file concatenated into myfile.gz would be broken into parts as: myfile0.gz, myfile1.gz, etc.

Example: --targetSize,2

No
--outputCodec,CODEC

Specifies the compression codec to use for the copied files. This can take the values: gzip, lzo, snappy, or none. You can use this option, for example, to convert input files compressed with Gzip into output files with LZO compression, or to uncompress the files as part of the copy operation. If you do not specify a value for --outputCodec the files are copied over with no change in their compression.

Example: --outputCodec,lzo

No
--s3ServerSideEncryption

Ensures that the target data is transferred using SSL and automatically encrypted in Amazon S3 using an AWS service-side key. When retrieving data using s3distcp, the objects are automatically unencrypted. If you attempt to copy an unencrypted object to an encryption-required Amazon S3 bucket, the operation will fail. For more information, see Using Data Encryption.

Example: --s3ServerSideEncryption

No
--deleteOnSuccess

If the copy operation is successful, this option causes S3DistCp to delete the copied files from the source location. This is useful if you are copying output files, such as log files, from one location to another as a scheduled task, and you don't want to copy the same files twice.

Example: --deleteOnSuccess

No
--disableMultipartUpload

Disables the use of multipart upload. For more information about multipart upload, see Configure Multipart Upload for Amazon S3.

Example: --disableMultipartUpload

No
--multipartUploadChunkSize,SIZE

The size, in MiB, of the multipart upload part size. By default it will use multipart upload when writing to Amazon S3. The default chunk size is 16 MiB.

Example: --multipartUploadChunkSize,32

No
--numberFiles

Prepends output files with sequential numbers. The count starts at 0 unless a different value is specified by --startingIndex.

Example: --numberFiles

No
--startingIndex,INDEX

Used with --numberFiles to specify the first number in the sequence.

Example: --startingIndex,1

No
--outputManifest,FILENAME

Creates a text file, compressed with Gzip, that contains a list of all the files that were copied by S3DistCp.

Example: --outputManifest,manifest-1.gz

No
--previousManifest,PATH

Reads a manifest file that was created during a previous call to S3DistCp using the --outputManifest flag. When the --previousManifest flag is set, S3DistCp excludes the files listed in the manifest from the copy operation. If --outputManifest is specified along with --previousManifest, files listed in the previous manifest will also appear in the new manifest file, although the files will not be recopied.

Example: --previousManifest,/usr/bin/manifest-1.gz

No
--copyFromManifest

Reverses the behavior of --previousManifest to cause S3DistCp to use the specified manifest file as a list of files to copy, instead of a list of files to exclude from copying.

Example: --copyFromManifest --previousManifest,/usr/bin/manifest-1.gz

No
--s3Endpoint ENDPOINT

Specifies the Amazon S3 endpoint to use when uploading a file. This option sets the endpoint for both the source and destination. If not set, the default endpoint is s3.amazonaws.com. For a list of the Amazon S3 endpoints, see Regions and Endpoints.

Example: --s3Endpoint s3-eu-west-1.amazonaws.com

No

In addition to the options above, S3DistCp implements the Tool interface which means that it supports the generic options.

Adding S3DistCp as a Step in a Cluster

You can call S3DistCp by adding it as a step in your cluster.

To add a S3DistCp step to a cluster using the CLI

  • Add a step to the cluster that calls S3DistCp, passing in the parameters that specify how S3DistCp should perform the copy operation. For more information about adding steps to a cluster, see Add Steps to a Cluster.

    The following example copies daemon logs from Amazon S3 to hdfs:///output.

    In this CLI command:

    • --jobflow specifies the cluster to add the copy step to.

    • --jar is the location of the S3DistCp JAR file.

    • --args is a comma-separated list of the option name-value pairs to pass in to S3DistCp. For a complete list of the available options, see S3DistCp Options. You can also specify the options singly, using multiple --arg parameters. Both forms are shown in examples below.

    You can use either the --args or --arg syntax to pass options into the cluster step. The --args parameter is a convenient way to pass in several --arg parameters at one time. It splits the string passed in on comma (,) characters to parse them into arguments. This syntax is shown in the following example. Note that the value passed in by --args is enclosed in single quotes ('). This prevents asterisks (*) and any other special characters in any regular expressions from being expanded by the Linux shell.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information about the Amazon EMR CLI, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow JobFlowID --jar \
      /home/hadoop/lib/emr-s3distcp-1.0.jar \
      --args 'S3DistCp-OptionName1,S3DistCp-OptionValue1, \
      S3DistCp-OptionName2,S3DistCp-OptionValue2,\
      S3DistCp-OptionName3,S3DistCp-OptionValue3'
    • Windows users:

      ruby elastic-mapreduce --jobflow JobFlowID --jar 
      /home/hadoop/lib/emr-s3distcp-1.0.jar 
      --args 'S3DistCp-OptionName1,S3DistCp-OptionValue1, 
      S3DistCp-OptionName2,S3DistCp-OptionValue2,
      S3DistCp-OptionName3,S3DistCp-OptionValue3'

    If the value of a S3DistCp option contains a comma, you cannot use --args, and must use instead individual --arg parameters to pass in the S3DistCp option names and values. Only the --src and --dest arguments are required. Note that the option values are enclosed in single quotes ('). This prevents asterisks (*) and any other special characters in any regular expressions from being expanded by the Linux shell.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow JobFlowID --jar \
      /home/hadoop/lib/emr-s3distcp-1.0.jar \
      --arg S3DistCp-OptionName1 --arg 'S3DistCp-OptionValue1' \
      --arg S3DistCp-OptionName2 --arg 'S3DistCp-OptionValue2' \
      --arg S3DistCp-OptionName3 --arg 'S3DistCp-OptionValue3' 
    • Windows users:

      ruby elastic-mapreduce --jobflow JobFlowID --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --arg S3DistCp-OptionName1 --arg 'S3DistCp-OptionValue1' --arg S3DistCp-OptionName2 --arg 'S3DistCp-OptionValue2' --arg S3DistCp-OptionName3 --arg 'S3DistCp-OptionValue3' 

Example Specify an option value that contains a comma

In this example, --srcPattern is set to '.*[a-zA-Z,]+'. The inclusion of a comma in the --srcPattern regular expression requires the use of individual --arg parameters.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
    /home/hadoop/lib/emr-s3distcp-1.0.jar \
    --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \
    --arg --src --arg 's3://myawsbucket/logs/j-3GY8JC4179IOJ/node/' \
    --arg --dest --arg 'hdfs:///output' \
    --arg --srcPattern --arg '.*[a-zA-Z,]+'
  • Windows users:

    ruby elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' --arg --src --arg 's3://myawsbucket/logs/j-3GY8JC4179IOJ/node/' --arg --dest --arg 'hdfs:///output' --arg --srcPattern --arg '.*[a-zA-Z,]+'

Example Copy log files from Amazon S3 to HDFS

This example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
    /home/hadoop/lib/emr-s3distcp-1.0.jar \
    --args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
    --dest,hdfs:///output,\
    --srcPattern,.*daemons.*-hadoop-.*'
  • Windows users:

    ruby elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,--dest,hdfs:///output,--srcPattern,.*daemons.*-hadoop-.*'

Example Load Amazon CloudFront logs into HDFS

This example loads Amazon CloudFront logs into HDFS. In the process it changes the compression format from Gzip (the CloudFront default) to LZO. This is useful because data compressed using LZO can be split into multiple maps as it is decompressed, so you don't have to wait until the compression is complete, as you do with Gzip. This provides better performance when you analyze the data using Amazon EMR. This example also improves performance by using the regular expression specified in the --groupBy option to combine all of the logs for a given hour into a single file. Amazon EMR clusters are more efficient when processing a few, large, LZO-compressed files than when processing many, small, Gzip-compressed files.

In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information about the Amazon EMR CLI, see the Command Line Interface Reference for Amazon EMR.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \
    /home/hadoop/lib/emr-s3distcp-1.0.jar \
    --args '--src,s3://myawsbucket/cf,\
    --dest,hdfs:///local,\
    --groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\
    --targetSize,128,\
    --outputCodec,lzo,--deleteOnSuccess'
  • Windows users:

    ruby elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf,--dest,hdfs:///local,--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,--targetSize,128,--outputCodec,lzo,--deleteOnSuccess'

Consider the case in which the preceding example is run over the following CloudFront log files.

s3://myawsbucket/cf/XABCD12345678.2012-02-23-01.HLUS3JKx.gz
s3://myawsbucket/cf/XABCD12345678.2012-02-23-01.I9CNAZrg.gz
s3://myawsbucket/cf/XABCD12345678.2012-02-23-02.YRRwERSA.gz
s3://myawsbucket/cf/XABCD12345678.2012-02-23-02.dshVLXFE.gz
s3://myawsbucket/cf/XABCD12345678.2012-02-23-02.LpLfuShd.gz
		

S3DistCp copies, concatenates, and compresses the files into the following two files, where the file name is determined by the match made by the regular expression.

hdfs:///local/2012-02-23-01.lzo
hdfs:///local/2012-02-23-02.lzo
		

To add a S3DistCp step to a cluster using the API

  • Send a request similar to the following example, where the arguments specified by Steps.member.1.HadoopJarStep.Args.member alternate the argument name and value, where the value is URL encoded.

    https://elasticmapreduce.amazonaws.com?
    JobFlowId=JobFlowID&
    Steps.member.1.Name="S3DistCp Step"&
    Steps.member.1.ActionOnFailure=CONTINUE&
    Steps.member.1.HadoopJarStep.Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar&
    Steps.member.1.HadoopJarStep.Args.member.1=--src&
    Steps.member.1.HadoopJarStep.Args.member.2=s3%3A%2F%2Fbucket%2Fcf&	
    Steps.member.1.HadoopJarStep.Args.member.3=--dest&
    Steps.member.1.HadoopJarStep.Args.member.4=hdfs%3A%2F%2F%2Flocal&
    Steps.member.1.HadoopJarStep.Args.member.5=--srcPattern&
    Steps.member.1.HadoopJarStep.Args.member.6=.%2A%5Ba-zA-Z%5D%2B&
    Steps.member.1.HadoopJarStep.Args.member.7=--groupBy&
    Steps.member.1.HadoopJarStep.Args.member.8=.%2A%5Ba-zA-Z%5D%2B&					
    Operation=AddJobFlowSteps&
    AWSAccessKeyId=access-key-identifier&
    SignatureVersion=2&
    SignatureMethod=HmacSHA256&
    Timestamp=2011-12-28T21%3A51%3A51.000Z&
    Signature=calculated-value		
    				

S3DistCp Versions Supported in Amazon EMR

Amazon EMR supports the following versions of S3DistCp.

VersionDescriptionRelease Date
1.0.7Adds the --s3ServerSideEncryption option.2 May 2013
1.0.6Adds the --s3Endpoint option.6 August 2012
1.0.5Improves the ability to specify which version of S3DistCp to run.27 June 2012
1.0.4Improves the --deleteOnSuccess option.19 June 2012
1.0.3Adds support for the --numberFiles and --startingIndex options.12 June 2012
1.0.2Improves file naming when using groups.6 June 2012
1.0.1Initial release of S3DistCp.19 January 2012