Migrate Hadoop data to Amazon S3 by using WANdisco LiveData Migrator - AWS Prescriptive Guidance

Migrate Hadoop data to Amazon S3 by using WANdisco LiveData Migrator

Created by Tony Velcich

Source: On-premises Hadoop cluster

Target: Amazon S3

R Type: Rehost

Environment: Production

Technologies: Data lakes; Big data; Hybrid cloud; Migration

Workload: All other workloads

AWS services: Amazon S3

Summary

This pattern describes the process for migrating Apache Hadoop data from a Hadoop Distributed File System (HDFS) to Amazon Simple Storage Service (Amazon S3). It uses WANdisco LiveData Migrator to automate the data migration process.

Prerequisites and limitations

Prerequisites

  • Hadoop cluster edge node where LiveData Migrator will be installed. The node should meet the following requirements:

    • Minimum specification: 4 CPUs, 16 GB RAM, 100 GB storage.

    • 2 Gbps minimum network.

    • Port 8081 accessible on your edge node to access the WANdisco UI.

    • Java 1.8 64-bit.

    • Hadoop client libraries installed on the edge node.

    • Ability to authenticate as the HDFS superuser (for example, "hdfs").

    • If Kerberos is enabled on your Hadoop cluster, a valid keytab that contains a suitable principal for the HDFS superuser must be available on the edge node.

    • See the release notes for a list of supported operating systems.

  • An active AWS account with access to an S3 bucket.

  • An AWS Direct Connect link established between your on-premises Hadoop cluster (specifically the edge node) and AWS.

Product versions

  • LiveData Migrator 1.8.6

  • WANdisco UI (OneUI) 5.8.0

Architecture

Source technology stack

  • On-premises Hadoop cluster

Target technology stack

  • Amazon S3

Architecture

The following diagram shows the LiveData Migrator solution architecture.

The workflow consists of four primary components for data migration from on-premises HDFS to Amazon S3.

  • LiveData Migrator – Automates the migration of data from HDFS to Amazon S3, and resides on an edge node of the Hadoop cluster.

  • HDFS – A distributed file system that provides high-throughput access to application data.

  • Amazon S3 – An object storage service that offers scalability, data availability, security, and performance.

  • AWS Direct Connect – A service that establishes a dedicated network connection from your on-premises data centers to AWS.

Automation and scale

You will typically create multiple migrations so that you can select specific content from your source file system by path or directory. You can also migrate data to multiple, independent file systems at the same time by defining multiple migration resources.

Epics

Task Description Skills required
Sign in to your AWS account.

Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

AWS experience
Create an S3 bucket.

If you don't already have an existing S3 bucket to use as the target storage, choose the “Create bucket” option on the Amazon S3 console, and specify a bucket name, AWS Region, and bucket settings for block public access. AWS and WANdisco recommend that you enable the block public access options for the S3 bucket, and set up the bucket access and user permission policies to meet your organization's requirements. An AWS example is provided at https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example1.html.

AWS experience
Task Description Skills required
Download the LiveData Migrator installer.

Download the LiveData Migrator installer and upload it to the Hadoop edge node. You can download a free trial of LiveData Migrator at https://www2.wandisco.com/ldm-trial. You can also obtain access to LiveData Migrator from AWS Marketplace, at https://aws.amazon.com/marketplace/pp/B07B8SZND9.

Hadoop administrator, Application owner
Install LiveData Migrator.

Use the downloaded installer and install LiveData Migrator as the HDFS superuser on an edge node in your Hadoop cluster. See the "Additional information" section for the installation commands.

Hadoop administrator, Application owner
Check the status of LiveData Migrator and other services.

Check the status of LiveData Migrator, Hive migrator, and WANdisco UI by using the commands provided in the "Additional information" section.

Hadoop administrator, Application owner
Task Description Skills required
Register your LiveData Migrator account.

Log in to the WANdisco UI through a web browser on port 8081 (on the Hadoop edge node) and provide your details for registration. For example, if you are running LiveData Migrator on a host named myldmhost.example.com, the URL would be: http://myldmhost.example.com:8081

Application owner
Configure your source HDFS storage.

Provide the configuration details needed for your source HDFS storage. This will include the "fs.defaultFS" value and a user-defined storage name. If Kerberos is enabled, provide the principal and keytab location for LiveData Migrator to use. If NameNode HA is enabled on the cluster, provide a path to the core-site.xml and hdfs-site.xml files on the edge node.

Hadoop administrator, Application owner
Configure your target Amazon S3 storage.

Add your target storage as the S3a type. Provide the user-defined storage name and the S3 bucket name. Enter "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" for the Credentials Provider option, and provide the AWS access and secret keys for the S3 bucket. Additional S3a properties will also be needed. For details, see the "S3a Properties" section in the LiveData Migrator documentation at https://docs.wandisco.com/live-data-migrator/docs/command-reference/#filesystem-add-s3a.

AWS, Application owner
Task Description Skills required
Add exclusions (if needed).

If you want to exclude specific datasets from migration, add exclusions for the source HDFS storage. These exclusions can be based on file size, file names (based on regex patterns), and modification date.

Hadoop administrator, Application owner
Task Description Skills required
Create and configure the migration.

Create a migration in the dashboard of the WANdisco UI. Choose your source (HDFS) and target (the S3 bucket). Add new exclusions that you have defined in the previous step. Select either the "Overwrite" or the "Skip if Size Match" option. Create the migration when all fields are complete.

Hadoop administrator, Application owner
Start the migration.

On the dashboard, select the migration you created. Click to start the migration. You can also start a migration automatically by choosing the auto-start option when you create the migration.

Application owner
Task Description Skills required
Set a network bandwidth limit between the source and target.

In the Storages list on the dashboard, select your source storage and select "Bandwidth Management" in the Grouping list. Clear the unlimited option, and define the maximum bandwidth limit and unit. Choose "Apply."

Application owner, Networking
Task Description Skills required
View migration information using the WANdisco UI.

Use the WANdisco UI to view license, bandwidth, storage and migration information. The UI also provides a notification system so you can receive notifications about errors, warnings, or important milestones in your usage.

Hadoop administrator, Application owner
Stop, resume, and delete migrations.

You can stop a migration from transferring content to its target by placing it in the STOPPED state. Stopped migrations can be resumed. Migrations in the STOPPED state can also be deleted.

Hadoop administrator, Application owner

Additional information

Installing LiveData Migrator

You can use the following commands to install LiveData Migrator, assuming that the installer is inside your working directory:

su – hdfs chmod +x livedata-migrator.sh && sudo ./livedata-migrator.sh

Checking the status of LiveData Migrator and other services after installation

Use the following commands to check the status of LiveData Migrator, Hive migrator, and WANdisco UI:

service livedata-migrator status service hivemigrator status service livedata-ui status