Loading Data from Amazon EMR
You can use the COPY command to load data in parallel from an Amazon EMR cluster configured to write text files to the cluster's Hadoop Distributed File System (HDFS) in the form of fixed-width files, character-delimited files, CSV files, or JSON-formatted files.
Loading Data From Amazon EMR Process
This section walks you through the process of loading data from an Amazon EMR cluster. The following sections provide the details you need to accomplish each step.
The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must have the necessary permissions.
Configure the cluster to output text files to the Hadoop Distributed File System (HDFS). You will need the Amazon EMR cluster ID and the cluster's master public DNS (the endpoint for the Amazon EC2 instance that hosts the cluster).
The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the hosts. You will use the IP address for each cluster node to configure the host security groups to permit access from your Amazon Redshift cluster using these IP addresses.
You add the Amazon Redshift cluster public key to the host's authorized keys file so that the host will recognize the Amazon Redshift cluster and accept the SSH connection.
Modify the Amazon EMR instance's security groups to add ingress rules to accept the Amazon Redshift IP addresses.
From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift table.