Summary Prerequisites and limitations Architecture Tools Epics

Migrate data from an on-premises Hadoop environment to Amazon S3 using DistCp with AWS PrivateLink for Amazon S3

Created by Jason Owens (AWS), Andres Cantor (AWS), Jeff Klopfenstein (AWS), Bruno Rocha Oliveira (AWS), and Samuel Schmidt (AWS)

Environment: Production	Source: Hadoop	Target: Any
R Type: Replatform	Workload: Open-source	Technologies: Storage & backup; Analytics
AWS services: Amazon S3; Amazon EMR

Summary

This pattern demonstrates how to migrate nearly any amount of data from an on-premises Apache Hadoop environment to the Amazon Web Services (AWS) Cloud by using the Apache open-source tool DistCp with AWS PrivateLink for Amazon Simple Storage Service (Amazon S3). Instead of using the public internet or a proxy solution to migrate data, you can use AWS PrivateLink for Amazon S3 to migrate data to Amazon S3 over a private network connection between your on-premises data center and an Amazon Virtual Private Cloud (Amazon VPC). If you use DNS entries in Amazon Route 53 or add entries in the /etc/hosts file in all nodes of your on-premises Hadoop cluster, then you are automatically directed to the correct interface endpoint.

This guide provides instructions for using DistCp for migrating data to the AWS Cloud. DistCp is the most commonly used tool, but other migration tools are available. For example, you can use offline AWS tools like AWS Snowball or AWS Snowmobile, or online AWS tools like AWS Storage Gateway or AWS DataSync. Additionally, you can use other open-source tools like Apache NiFi.

Prerequisites and limitations

Prerequisites

An active AWS account with a private network connection between your on-premises data center and the AWS Cloud
Hadoop, installed on premises with DistCp
A Hadoop user with access to the migration data in the Hadoop Distributed File System (HDFS)
AWS Command Line Interface (AWS CLI), installed and configured
Permissions to put objects into an S3 bucket

Limitations

Virtual private cloud (VPC) limitations apply to AWS PrivateLink for Amazon S3. For more information, see Interface endpoint properties and limitations and AWS PrivateLink quotas (AWS PrivateLink documentation).

AWS PrivateLink for Amazon S3 doesn't support the following:

Architecture

Source technology stack

Hadoop cluster with DistCp installed

Target technology stack

Amazon S3
Amazon VPC

Target architecture

Hadoop cluster with DistCp copies data from on-premises environment through Direct Connect to S3.

The diagram shows how the Hadoop administrator uses DistCp to copy data from an on-premises environment through a private network connection, such as AWS Direct Connect, to Amazon S3 through an Amazon S3 interface endpoint.

Tools

AWS services

AWS Identity and Access Management (IAM) helps you securely manage access to your AWS resources by controlling who is authenticated and authorized to use them.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Amazon Virtual Private Cloud (Amazon VPC) helps you launch AWS resources into a virtual network that you’ve defined. This virtual network resembles a traditional network that you’d operate in your own data center, with the benefits of using the scalable infrastructure of AWS.

Other tools

Apache Hadoop DistCp (distributed copy) is a tool used for copying large inter-clusters and intra-clusters. DistCp uses Apache MapReduce for distribution, error handling and recovery, and reporting.

Epics

Task	Description	Skills required
Create an endpoint for AWS PrivateLink for Amazon S3.	Sign in to the AWS Management Console and open the Amazon VPC console. On the navigation pane, choose Endpoints, and then choose Create Endpoint. For Service category, choose AWS services. In the search box, enter s3, and then press Enter. In the search results, choose the com.amazonaws.<your-aws-region>.s3 service name where the value in the Type column is Interface. For VPC, choose your VPC. For Subnets, choose your subnets. For Security group, choose or create a security group that allows TCP 443. Add tags based on your requirements and then choose Create endpoint.	AWS administrator
Verify the endpoints and find the DNS entries.	Open the Amazon VPC console, choose Endpoints, and then select the endpoint that you created earlier. On the Details tab, find the first DNS entry for DNS names. This is the Regional DNS entry. When you use this DNS name, requests alternate between DNS entries specific to Availability Zones. Choose the Subnets tab. You can find the address of the endpoint’s elastic network interface in each Availability Zone.	AWS administrator
Check the firewall rules and routing configurations.	To confirm that your firewall rules are open and that your networking configuration is correctly set up, use Telnet to test the endpoint on port 443. For example: `$ telnet vpce-<your-VPC-endpoint-ID>.s3.us-east-2.vpce.amazonaws.com 443 Trying 10.104.88.6... Connected to vpce-<your-VPC-endpoint-ID>.s3.us-east-2.vpce.amazonaws.com. ... $ telnet vpce-<your-VPC-endpoint-ID>.s3.us-east-2.vpce.amazonaws.com 443 Trying 10.104.71.141... Connected to vpce-<your-VPC-endpoint-ID>.s3.us-east-2.vpce.amazonaws.com.` Note: If you use the Regional entry, a successful test shows that the DNS is alternating between the two IP addresses that you can see on the Subnets tab for your selected endpoint in the Amazon VPC console.	Network administrator, AWS administrator
Configure the name resolution.	You must configure the name resolution to allow Hadoop to access the Amazon S3 interface endpoint. You can’t use the endpoint name itself. Instead, you must resolve `<your-bucket-name>.s3.<your-aws-region>.amazonaws.com` or `.s3.<your-aws-region>.amazonaws.com`. For more information on this naming limitation, see Introducing the Hadoop S3A client (Hadoop website). Choose one of the following configuration options: Use on-premises DNS to resolve the private IP address of the endpoint. You can override behavior for all buckets or selected buckets. For more information, see “Option 2: Access Amazon S3 using Domain Name System Response Policy Zones (DNS RPZ)” in Secure hybrid access to Amazon S3 using AWS PrivateLink (AWS blog post). Configure on-premises DNS to conditionally forward traffic to the resolver inbound endpoints in the VPC. Traffic is forwarded to Route 53. For more information, see “Option 3: Forwarding DNS requests from on premises using Amazon Route 53 Resolver Inbound Endpoints” in Secure hybrid access to Amazon S3 using AWS PrivateLink (AWS blog post). Edit the /etc/hosts* file on all the nodes in your Hadoop cluster. This is a temporary solution for testing and isn't recommended for production. To edit the /etc/hosts file, add an entry for either `<your-bucket-name>.s3.<your-aws-region>.amazonaws.com` or `s3.<your-aws-region>.amazonaws.com`. The /etc/hosts file can’t have multiple IP addresses for an entry. You must choose a single IP address from one of the Availability Zones, which then becomes a single point of failure.	AWS administrator
Configure authentication for Amazon S3.	To authenticate to Amazon S3 through Hadoop, we recommend that you export temporary role credentials to the Hadoop environment. For more information, see Authenticating with S3 (Hadoop website). For long-running jobs, you can create a user and assign a policy that has permissions to put data into an S3 bucket only. The access key and secret key can be stored on Hadoop, accessible only to the DistCp job itself and to the Hadoop administrator. For more information on storing secrets, see Storing secrets with Hadoop Credential Providers (Hadoop website). For more information on other authentication methods, see How to get credentials of an IAM role for use with CLI access to an AWS account in the documentation for AWS IAM Identity Center (successor to AWS Single Sign-On). To use temporary credentials, add the temporary credentials to your credentials file, or run the following commands to export the credentials to your environment: `export AWS_SESSION_TOKEN=SECRET-SESSION-TOKEN export AWS_ACCESS_KEY_ID=SESSION-ACCESS-KEY export AWS_SECRET_ACCESS_KEY=SESSION-SECRET-KEY` If you have a traditional access key and secret key combination, run the following commands: `export AWS_ACCESS_KEY_ID=my.aws.key export AWS_SECRET_ACCESS_KEY=my.secret.key` Note: If you use an access key and secret key combination, then change the credentials provider in the DistCp commands from `"org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"` to `"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"`.	AWS administrator
Transfer data by using DistCp.	To use DistCp to transfer data, run the following commands: `hadoop distcp -Dfs.s3a.aws.credentials.provider=\ "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider" \ -Dfs.s3a.access.key="${AWS_ACCESS_KEY_ID}" \ -Dfs.s3a.secret.key="${AWS_SECRET_ACCESS_KEY}" \ -Dfs.s3a.session.token="${AWS_SESSION_TOKEN}" \ -Dfs.s3a.path.style.access=true \ -Dfs.s3a.connection.ssl.enabled=true \ -Dfs.s3a.endpoint=s3.<your-aws-region>.amazonaws.com \ hdfs:///user/root/ s3a://<your-bucket-name>` Note: The AWS Region of the endpoint isn’t automatically discovered when you use the DistCp command with AWS PrivateLink for Amazon S3. Hadoop 3.3.2 and later versions resolve this issue by enabling the option to explicitly set the AWS Region of the S3 bucket. For more information, see S3A to add option fs.s3a.endpoint.region to set AWS region (Hadoop website). For more information on additional S3A providers, see General S3A Client configuration (Hadoop website). For example, if you use encryption, you can add the following option to the series of commands above depending on your type of encryption: `-Dfs.s3a.server-side-encryption-algorithm=AES-256 [or SSE-C or SSE-KMS]` Note: To use the interface endpoint with S3A, you must create a DNS alias entry for the S3 Regional name (for example, `s3.<your-aws-region>.amazonaws.com`) to the interface endpoint. See the Configure authentication for Amazon S3 section for instructions. This workaround is required for Hadoop 3.3.2 and earlier versions. Future versions of S3A won’t require this workaround. If you have signature issues with Amazon S3, add an option to use Signature Version 4 (SigV4) signing: `-Dmapreduce.map.java.opts="-Dcom.amazonaws.services.s3.enableV4=true"`	Migration engineer, AWS administrator

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Copy S3 objects between accounts and Regions by using S3 Batch Replication

Use CloudEndure for disaster recovery on premises