Migrate data from an on-premises Hadoop environment to Amazon S3 using DistCp with AWS PrivateLink for Amazon S3
Created by Jason Owens (AWS), Andres Cantor (AWS), Jeff Klopfenstein (AWS), Bruno Rocha Oliveira (AWS), and Samuel Schmidt (AWS)
Environment: Production | Source: Hadoop | Target: Any |
R Type: Replatform | Workload: Open-source | Technologies: Storage & backup; Analytics |
AWS services: Amazon S3; Amazon EMR |
Summary
This pattern demonstrates how to migrate nearly any amount of data from an on-premises Apache Hadoop environment to the Amazon Web Services (AWS) Cloud by using the Apache open-source tool DistCp
This guide provides instructions for using DistCp for migrating data to the AWS Cloud. DistCp is the most commonly used tool, but other migration tools are available. For example, you can use offline AWS tools like AWS Snowball or AWS Snowmobile, or online AWS tools like AWS Storage Gateway or AWS DataSync
Prerequisites and limitations
Prerequisites
An active AWS account with a private network connection between your on-premises data center and the AWS Cloud
A Hadoop user with access to the migration data in the Hadoop Distributed File System (HDFS)
AWS Command Line Interface (AWS CLI), installed and configured
Permissions to put objects into an S3 bucket
Limitations
Virtual private cloud (VPC) limitations apply to AWS PrivateLink for Amazon S3. For more information, see Interface endpoint properties and limitations and AWS PrivateLink quotas (AWS PrivateLink documentation).
AWS PrivateLink for Amazon S3 doesn't support the following:
Architecture
Source technology stack
Hadoop cluster with DistCp installed
Target technology stack
Amazon S3
Amazon VPC
Target architecture
The diagram shows how the Hadoop administrator uses DistCp to copy data from an on-premises environment through a private network connection, such as AWS Direct Connect, to Amazon S3 through an Amazon S3 interface endpoint.
Tools
AWS services
AWS Identity and Access Management (IAM) helps you securely manage access to your AWS resources by controlling who is authenticated and authorized to use them.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Amazon Virtual Private Cloud (Amazon VPC) helps you launch AWS resources into a virtual network that you’ve defined. This virtual network resembles a traditional network that you’d operate in your own data center, with the benefits of using the scalable infrastructure of AWS.
Other tools
Apache Hadoop DistCp
(distributed copy) is a tool used for copying large inter-clusters and intra-clusters. DistCp uses Apache MapReduce for distribution, error handling and recovery, and reporting.
Epics
Task | Description | Skills required |
---|---|---|
Create an endpoint for AWS PrivateLink for Amazon S3. |
| AWS administrator |
Verify the endpoints and find the DNS entries. |
| AWS administrator |
Check the firewall rules and routing configurations. | To confirm that your firewall rules are open and that your networking configuration is correctly set up, use Telnet to test the endpoint on port 443. For example:
Note: If you use the Regional entry, a successful test shows that the DNS is alternating between the two IP addresses that you can see on the Subnets tab for your selected endpoint in the Amazon VPC console. | Network administrator, AWS administrator |
Configure the name resolution. | You must configure the name resolution to allow Hadoop to access the Amazon S3 interface endpoint. You can’t use the endpoint name itself. Instead, you must resolve Choose one of the following configuration options:
| AWS administrator |
Configure authentication for Amazon S3. | To authenticate to Amazon S3 through Hadoop, we recommend that you export temporary role credentials to the Hadoop environment. For more information, see Authenticating with S3 To use temporary credentials, add the temporary credentials to your credentials file, or run the following commands to export the credentials to your environment:
If you have a traditional access key and secret key combination, run the following commands:
Note: If you use an access key and secret key combination, then change the credentials provider in the DistCp commands from | AWS administrator |
Transfer data by using DistCp. | To use DistCp to transfer data, run the following commands:
Note: The AWS Region of the endpoint isn’t automatically discovered when you use the DistCp command with AWS PrivateLink for Amazon S3. Hadoop 3.3.2 and later versions resolve this issue by enabling the option to explicitly set the AWS Region of the S3 bucket. For more information, see S3A to add option fs.s3a.endpoint.region to set AWS region For more information on additional S3A providers, see General S3A Client configuration
Note: To use the interface endpoint with S3A, you must create a DNS alias entry for the S3 Regional name (for example, If you have signature issues with Amazon S3, add an option to use Signature Version 4 (SigV4) signing:
| Migration engineer, AWS administrator |