COPY from Amazon EMR - Amazon Redshift

COPY from Amazon EMR

You can use the COPY command to load data in parallel from an Amazon EMR cluster configured to write text files to the cluster's Hadoop Distributed File System (HDFS) in the form of fixed-width files, character-delimited files, CSV files, JSON-formatted files, or Avro files.

Syntax

FROM 'emr://emr_cluster_id/hdfs_filepath' authorization [ optional_parameters ]

Example

The following example loads data from an Amazon EMR cluster.

copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Parameters

FROM

The source of the data to be loaded.

'emr://emr_cluster_id/hdfs_file_path'

The unique identifier for the Amazon EMR cluster and the HDFS file path that references the data files for the COPY command. The HDFS data file names must not contain the wildcard characters asterisk (*) and question mark (?).

Note

The Amazon EMR cluster must continue running until the COPY operation completes. If any of the HDFS data files are changed or deleted before the COPY operation completes, you might have unexpected results, or the COPY operation might fail.

You can use the wildcard characters asterisk (*) and question mark (?) as part of the hdfs_file_path argument to specify multiple files to be loaded. For example, 'emr://j-SAMPLE2B500FC/myoutput/part*' identifies the files part-0000, part-0001, and so on. If the file path doesn't contain wildcard characters, it is treated as a string literal. If you specify only a folder name, COPY attempts to load all files in the folder.

Important

If you use wildcard characters or use only the folder name, verify that no unwanted files will be loaded. For example, some processes might write a log file to the output folder.

For more information, see Loading data from Amazon EMR.

authorization

The COPY command needs authorization to access data in another AWS resource, including in Amazon S3, Amazon EMR, Amazon DynamoDB, and Amazon EC2. You can provide that authorization by referencing an AWS Identity and Access Management (IAM) role that is attached to your cluster (role-based access control) or by providing the access credentials for a user (key-based access control). For increased security and flexibility, we recommend using IAM role-based access control. For more information, see Authorization parameters.

Supported parameters

You can optionally specify the following parameters with COPY from Amazon EMR:

Unsupported parameters

You can't use the following parameters with COPY from Amazon EMR:

  • ENCRYPTED

  • MANIFEST

  • REGION

  • READRATIO

  • SSH