S3DistCp (s3-dist-cp)
Apache DistCp is an open-source tool you can use to copy large amounts of data.
S3DistCp is similar to DistCp, but optimized to work with
AWS, particularly Amazon S3. The command for S3DistCp in Amazon EMR version 4.0 and later is
s3-dist-cp
, which you add as a step in a cluster or at the command
line. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS
where it can be processed by subsequent steps in your Amazon EMR cluster. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. S3DistCp is more
scalable and efficient for parallel copying large numbers of objects across buckets and
across AWS accounts.
For specific commands that demonstrate the flexibility of S3DistCP in real-world
scenarios, see Seven tips for using S3DistCp
Like DistCp, S3DistCp uses MapReduce to copy in a distributed manner. It shares the
copy, error handling, recovery, and reporting tasks across several servers. For more
information about the Apache DistCp open source project, see the DistCp
guide
If S3DistCp is unable to copy some or all of the specified files, the cluster step fails and returns a non-zero error code. If this occurs, S3DistCp does not clean up partially copied files.
Important
S3DistCp does not support Amazon S3 bucket names that contain the underscore character.
S3DistCp does not support concatenation for Parquet files. Use PySpark instead.
For more information, see Concatenating parquet files in Amazon EMR
To avoid copy errors when using S3DistCP to copy a single file (instead of a directory) from S3 to HDFS, use Amazon EMR version 5.33.0 or later, or Amazon EMR version 6.3.0 or later.
S3DistCp options
Though similar to DistCp, S3DistCp supports a different set of options to change how it copies and compresses data.
When you call S3DistCp, you can specify the options described in the following table. The options are added to the step using the arguments list. Examples of the S3DistCp arguments are shown in the following table.
Option | Description | Required |
---|---|---|
‑‑src=LOCATION
|
Location of the data to copy. This can be either an HDFS or Amazon S3 location. Example:
ImportantS3DistCp does not support Amazon S3 bucket names that contain the underscore character. |
Yes |
‑‑dest=LOCATION
|
Destination for the data. This can be either an HDFS or Amazon S3 location. Example: ImportantS3DistCp does not support Amazon S3 bucket names that contain the underscore character. |
Yes |
‑‑srcPattern=PATTERN
|
A regular expression If the regular expression argument contains special
characters, such as an asterisk (*), either the regular
expression or the entire Example:
|
No |
‑‑groupBy=PATTERN
|
A regular expression Parentheses indicate how files should be grouped, with all of the items that match the parenthetical statement being combined into a single output file. If the regular expression does not include a parenthetical statement, the cluster fails on the S3DistCp step and return an error. If the regular expression argument contains special
characters, such as an asterisk (*), either the regular
expression or the entire When Example:
|
No |
‑‑targetSize=SIZE
|
The size, in mebibytes (MiB), of the files to create based on
the If the files concatenated by
Example: |
No |
‑‑appendToLastFile |
Specifies the behavior of S3DistCp when copying to files from
Amazon S3 to HDFS which are already present. It appends new file data
to existing files. If you use
|
No |
‑‑outputCodec=CODEC
|
Specifies the compression codec to use for the copied files.
This can take the values: Example: |
No |
‑‑s3ServerSideEncryption
|
Ensures that the target data is transferred using SSL and automatically encrypted in Amazon S3 using an AWS service-side key. When retrieving data using S3DistCp, the objects are automatically unencrypted. If you attempt to copy an unencrypted object to an encryption-required Amazon S3 bucket, the operation fails. For more information, see Using data encryption. Example: |
No |
‑‑deleteOnSuccess
|
If the copy operation is successful, this option causes S3DistCp to delete the copied files from the source location. This is useful if you are copying output files, such as log files, from one location to another as a scheduled task, and you don't want to copy the same files twice. Example: |
No |
‑‑disableMultipartUpload
|
Disables the use of multipart upload. Example: |
No |
‑‑multipartUploadChunkSize=SIZE
|
The size, in MiB, of each part in an Amazon S3 multipart upload.
S3DistCp uses multipart upload when it copies data larger than
the Example:
|
No |
‑‑numberFiles
|
Prepends output files with sequential numbers. The count
starts at 0 unless a different value is specified by
Example: |
No |
‑‑startingIndex=INDEX
|
Used with Example: |
No |
‑‑outputManifest=FILENAME
|
Creates a text file, compressed with Gzip, that contains a list of all the files copied by S3DistCp. Example:
|
No |
‑‑previousManifest=PATH
|
Reads a manifest file that was created during a previous call
to S3DistCp using the Example:
|
No |
‑‑requirePreviousManifest |
Requires a previous manifest created during a previous call to S3DistCp. If this is set to false, no error is generated when a previous manifest is not specified. The default is true. |
No |
‑‑copyFromManifest
|
Reverses the behavior of
Example: |
No |
‑‑s3Endpoint=ENDPOINT |
Specifies the Amazon S3 endpoint to use when uploading a file. This
option sets the endpoint for both the source and destination. If
not set, the default endpoint is Example:
|
No |
‑‑storageClass=CLASS |
The storage class to use when the destination is Amazon S3. Valid values are STANDARD and REDUCED_REDUNDANCY. If this option is not specified, S3DistCp tries to preserve the storage class. Example:
|
No |
‑‑srcPrefixesFile=PATH |
a text file in Amazon S3 (s3://), HDFS (hdfs:///) or local file
system (file:/) that contains a list of If Example:
|
No |
In addition to the options above, S3DistCp implements the Tool interface
Adding S3DistCp as a step in a cluster
You can call S3DistCp by adding it as a step in your cluster. Steps can be added to a cluster at launch or to a running cluster using the console, CLI, or API. The following examples demonstrate adding an S3DistCp step to a running cluster. For more information on adding steps to a cluster, see Submit work to a cluster in the Amazon EMR Management Guide.
To add a S3DistCp step to a running cluster using the AWS CLI
For more information on using Amazon EMR commands in the AWS CLI, see the AWS CLI Command Reference.
-
To add a step to a cluster that calls S3DistCp, pass the parameters that specify how S3DistCp should perform the copy operation as arguments.
The following example copies daemon logs from Amazon S3 to
hdfs:///output
. In the following command:-
‑‑cluster-id
specifies the cluster -
Jar
is the location of the S3DistCp JAR file. For an example of how to run a command on a cluster using command-runner.jar, see Submit a custom JAR step to run a script or command. -
Args
is a comma-separated list of the option name-value pairs to pass in to S3DistCp. For a complete list of the available options, see S3DistCp options.
To add an S3DistCp copy step to a running cluster, put the following in a JSON file saved in Amazon S3 or your local file system as
for this example. ReplacemyStep.json
j-3GYXXXXXX9IOK
with your cluster ID and replaceamzn-s3-demo-bucket
with your Amazon S3 bucket name.[ { "Name":"S3DistCp step", "Args":["s3-dist-cp","‑‑s3Endpoint=s3.amazonaws.com","‑‑src=s3://amzn-s3-demo-bucket/logs/j-3GYXXXXXX9IOJ/node/","‑‑dest=hdfs:///output","‑‑srcPattern=.*[a-zA-Z,]+"], "ActionOnFailure":"CONTINUE", "Type":"CUSTOM_JAR", "Jar":"command-runner.jar" } ]
aws emr add-steps ‑‑cluster-id
j-3GYXXXXXX9IOK
‑‑steps file://./myStep.json -
Example Copy log files from Amazon S3 to HDFS
This example also illustrates how to copy log files stored in an Amazon S3 bucket
into HDFS by adding a step to a running cluster. In this example the
‑‑srcPattern
option is used to limit the data
copied to the daemon logs.
To copy log files from Amazon S3 to HDFS using the
‑‑srcPattern
option, put the following in a JSON
file saved in Amazon S3 or your local file system as
for this
example. Replace myStep.json
j-3GYXXXXXX9IOK
with your cluster ID
and replace amzn-s3-demo-bucket
with your Amazon S3 bucket
name.
[ { "Name":"S3DistCp step", "Args":["s3-dist-cp","‑‑s3Endpoint=s3.amazonaws.com","‑‑src=s3://amzn-s3-demo-bucket/logs/j-3GYXXXXXX9IOJ/node/","‑‑dest=hdfs:///output","‑‑srcPattern=.*daemons.*-hadoop-.*"], "ActionOnFailure":"CONTINUE", "Type":"CUSTOM_JAR", "Jar":"command-runner.jar" } ]