AWS Prescriptive Guidance
Patterns

Migrate on-premises CDH clusters to Amazon S3 using WANdisco Fusion

R Type :RePlatform

source :Data Management

target :Data lake on AWS

tags :amazon vpc, amazon ec2, amazon s3, apache, hadoop, hadoop distributed file system, hdfs, container, dataset, aws auto scaling, hybrid, mapr, amazon emr (optional), cdh, hortonworks data platform, hdp, isilon

Summary

This pattern provides guidance for migrating on-premises CDH (Cloudera Distribution including Apache Hadoop) clusters to a data lake on Amazon Web Services (AWS) using WANdisco Fusion. This pattern incorporates Amazon Elastic Compute Cloud (Amazon EC2) compute capacity, Amazon Simple Storage Service (Amazon S3) storage, and Amazon Virtual Private Cloud (Amazon VPC). WANdisco Fusion enables you to migrate your clusters without downtime and without blocking client or application activity.

WANdisco provides the ability to continuously replicate data between your on-premises Apache Hadoop environment and Amazon S3, guaranteeing strong consistency between data residing on premises and in the cloud. 

You can also customize this pattern to enable disaster recovery scenarios for your on-premises Hadoop cluster by provisioning an Amazon EMR cluster that references data replicated into Amazon S3. (Note this pattern doesn't cover the deployment of Amazon EMR.) Amazon EMR can provide an effective low-cost disaster recovery environment in the event of on-premises Hadoop cluster failure. Migrating from an on-premises Hadoop cluster to Amazon EMR can reduce your costs in maintaining physical infrastructure, whether it is because of over-provisioned or idle hardware.

Assumptions and Prerequisites

Prerequisites

  • An active AWS account

  • A subscription to the Amazon Machine Image (AMI) for WANdisco Fusion in AWS Marketplace; you can choose from a no-commitment, metered option or use the Bring Your Own License (BYOL) option for a 14-day trial, after which you can purchase a license by contacting WANdisco

  • WANdisco Fusion servers located in the on-premises CDH cluster require a license and installer (contact WANdisco for details)

  • Root access on the CDH edge nodes

  • Root access on the Cloudera Management Server

  • Administrator access to CDH

  • Default options used during the installation of WANdisco Fusion on CDH edge nodes

Architecture

Source technology stack

  • WANdisco Fusion (Hadoop)

  • CDH cluster

Source architecture

  • CDH on-premises cluster

  • Two on-premises edge nodes deployed in the CDH cluster for WANdisco Fusion

Target technology stack

  • Data lake on AWS

  • WANdisco Fusion

  • Amazon S3

Target architecture

  • A virtual private cloud (VPC) configured with an Availability Zone

  • An AWS Direct Connect (DX) setup that enables network connectivity between the on-premises CDH edge nodes and the VPC

  • An AWS Identity and Access Management (IAM) role to control access to the resources created (this role is used to control WANdisco Fusion access to Amazon S3 for data synchronization)

  • AWS Auto Scaling to establish the initial configuration and connectivity between instances

  • An S3 bucket to store the content synchronized by WANdisco Fusion

Design considerations

AWS recommends deploying workloads into private subnets for security purposes, and WANdisco Fusion on AWS is launched on EC2 instances within a VPC. Your on-premises WANdisco Fusion components need to establish connections with these VPC-resident services. For more information, see the Amazon Virtual Private Cloud Connectivity Options whitepaper.

The other option is to use AWS Direct Connect to directly connect on-premises data center nodes to AWS. 

+

Tools Used

WANdisco Fusion -  WANdisco Fusion is a software application that allows replication of data between HCFS (for example, Apache Hadoop) deployments even where clusters are running different versions of Apache Hadoop. WANdisco Fusion allows replication of data between different vendor distributions and versions of Apache Hadoop. WANdisco Fusion also supports moving data between Apache Hadoop and Amazon S3. WANdisco Fusion provides:

  • A virtual file system for Apache Hadoop, compatible with all Apache Hadoop applications.

  • A single, virtual namespace that integrates storage from different types of Apache Hadoop deployments, including CDH, Hortonworks Data Platform (HDP), Dell EMC Isilon, Amazon S3, EMR File System (EMRFS), and MapR.

  • Storage that can be globally distributed.

  • WAN replication using WANdisco’s LiveData technology, which delivers single-copy consistent Hadoop Distributed File System (HDFS) data, replicated between geographically dispersed data centers. 

Epics

Prepare your AWS account

Tasks

Title Description Skills Predecessor
Create an AWS account. See https://aws.amazon.com for guidance. General AWS
Select an AWS Region. Use the region selector in the navigation bar to choose the AWS Region where you want to deploy the stack on AWS. General AWS
Create a key pair. Create an access/secret key pair in your preferred AWS Region. General AWS
If necessary, request a service limit increase. Depending on requirements, the WANdisco Fusion server(s) may require a larger EC2 instance type. General AWS

Configure the network and Amazon EC2

Tasks

Title Description Skills Predecessor
Configure the AWS Region, VPCs, Availability Zone, and subnets. Configure the AWS infrastructure, including the Availability Zone, CIDR ranges, and subnets. General AWS
Configure the key pair. Configure the public/private SSH key pair, which allows you to connect securely to the instance(s) after launch. General AWS
Configure bastion host CIDR ranges (optional). Configure the CIDR IP range that allows external SSH access to the bastion host instances. General AWS

Configure the VPC and DX

Tasks

Title Description Skills Predecessor
Configure the Availability Zone. This is the Availability Zone to use for the subnets in the VPC. General AWS
Configure AWS Direct Connect (DX). Set up DX between the on-premises CDH cluster WANdisco Fusion nodes and the VPC. General AWS

Download the WANdisco Fusion installer for CDH

Tasks

Title Description Skills Predecessor
Download the installer file for CDH distributions. Download the installer file and place it on the two edge nodes designated for WANdisco Fusion on the CDH cluster. System Admin

Complete the initial configuration of WANdisco Fusion for CDH

Tasks

Title Description Skills Predecessor
Specify the CDH version of the on-premises cluster. During the CLI portion of the WANdisco Fusion installation, select which version of CDH is being used. You can leave all other options at their default values. System Admin

Configure the WANdisco Fusion application for CDH

Tasks

Title Description Skills Predecessor
Access the web URL of the WANdisco Fusion node. To proceed with the UI portion of the WANdisco Fusion installation, use the fully qualified domain name (FQDN) of the WANdisco Fusion node to access the web URL on port 8083. System Admin
Upload the WANdisco Fusion license. Use the local desktop path to the WANdisco Fusion license key to be used with the on-premises installation of WANdisco Fusion. System Admin
Configure the FQDN of the Fusion node network interface. Provide the hostname of the Fusion server for installation that must be accessible to and from the EC2 instances. System Admin
Provide the zone name and node name. Provide the name that identifies the operating zone for the WANdisco Fusion server, and the name that was given to the local node. System Admin
Confirm the URI selection. Use the default setting HDFS URI with HDFS for live replication. System Admin
Provide the Cloudera Manager configuration details. Provide the Cloudera Manager hostname, port, user name, and password. Check whether Secure Sockets Layer (SSL) is enabled on the CDH UI, and adjust the port accordingly. System Admin
Provide Kerberos details, if required. Provide the configuration file path of the Key Distribution Center (KDC), the keytab file path, and the principal name for the WANdisco Fusion system user on the WANdisco Fusion node. System Admin; Security Admin
(Optional) Enable HTTP authentication and API authorization. If Kerberos is enabled, you can enable HTTP authentication by providing the keytab file path for the HTTP principal. You can also enable API authorization, if desired. System Admin; Security Admin
(Optional) Provide a WANdisco Fusion administrator user name. Provide a different user name, if desired. Note the generated password, or generate a new one. System Admin

Install the WANdisco Fusion client on the CDH cluster

Tasks

Title Description Skills Predecessor
Provide the location for parcel distribution on the Cloudera Management Server. Specify the file system location for the WANdisco Fusion client parcel on the Cloudera Management Server. System Admin

Launch the AWS CloudFormation template

Tasks

Title Description Skills Predecessor
Provide the Amazon S3 URL for the AWS CloudFormation template for WANdisco Fusion. The URL for the WANdisco Fusion template is automatically generated when you choose to launch the WANdisco Fusion - BYOL software via AWS CloudFormation in AWS Marketplace. General AWS

Configure the WANdisco Fusion EC2 instances

Tasks

Title Description Skills Predecessor
Provide the stack name for the WANdisco Fusion deployment. Provide the stack name for identification. General AWS
Specify the EC2 instance type. Specify the EC2 instance type for the WANdisco Fusion nodes. General AWS; System Admin
Provide the IDs for the VPC, the security group, and the VPC subnets. Specify the VPC, security group, and VPC subnet IDs to be associated with WANdisco Fusion in AWS. General AWS; System Admin
Specify the S3 bucket. Specify the name of the existing S3 bucket to use to replicate HDFS data from the CDH cluster. General AWS; System Admin
Allocate persistent storage for WANdisco Fusion server instances. Specify the Amazon Elastic Block Store (Amazon EBS) storage to allocate for each block device, in GB, with four devices per node. General AWS; System Admin
Specify the Amazon EC2 key pair name. Specify the public/private key pair, which allows you to connect securely to your instances after they launch. General AWS; System Admin
Provide the cluster name. Specify the name of the WANdisco Fusion cluster. General AWS

Configure WANdisco Fusion

Tasks

Title Description Skills Predecessor
Set the number of WANdisco Fusion nodes. The number of WANdisco Fusion servers (EC2 instances for the AWS WANdisco Fusion cluster) must be set to at least 2 to enable high availability. General AWS
Provide the zone name and the WANdisco Fusion administrator credentials. Specify the name used to identify the zone in which the WANdisco Fusion server operates, and the name and password of the administrator user for WANdisco Fusion. General AWS
(Optional) Provide the ARN topic to publish messages. Provide the Amazon Resource Name (ARN) of the topic for emailing status notifications. General AWS
Identify the Amazon EMR version. Specify the version of Amazon EMR, if you decide to attach an Amazon EMR cluster to replicate data back to your on-premises server after deployment. General AWS
(Optional) Specify the URL for the WANdisco Fusion license. Specify the path to the S3 bucket (in the format s3://bucketname/path) or the URL of the license key for WANdisco Fusion. General AWS

Configure Amazon S3 security for WANdisco Fusion

Tasks

Title Description Skills Predecessor
(Optional) Provide the AWS KMS key. Provide the ARN for the AWS Key Management Service (AWS KMS) encryption key ID. Leave this field blank to disable AWS KMS encryption. General AWS

Start WANdisco Fusion induction (CDH to Amazon S3)

Tasks

Title Description Skills Predecessor
Specify the FQDN of the WANdisco Fusion node in the CDH cluster. Provide the fully qualified domain name of the WANdisco Fusion node in the on-premises CDH cluster. System Admin

References and Help

References

AWS Marketplace links

Contact and help

Migration Pattern Library Support: aws-mpl@amazon.com