AWS Prescriptive Guidance
Patterns

Replicate Amazon EMR data to Amazon S3 in another AWS Region using WANdisco Fusion

R Type :RePlatform

source :Data Management

target :Data lake on AWS

tags :amazon vpc, amazon ec2, amazon s3, apache, hadoop, hadoop distributed file system, hdfs, container, dataset, aws auto scaling, hybrid, mapr, amazon emr (optional), cdh, hortonworks data platform, hdp, isilon

Summary

This pattern provides guidance for migrating Amazon EMR data to Amazon Simple Storage Service (Amazon S3) in another AWS Region using WANdisco Fusion. This pattern incorporates Amazon Elastic Compute Cloud (Amazon EC2) compute capacity, Amazon S3 storage, and Amazon Virtual Private Cloud (Amazon VPC).Migrating to Amazon EMR can drastically reduce your costs in maintaining physical infrastructure for over-provisioned or idle hardware. Amazon EMR offers a software as a service (SaaS) solution and a pay-as-you-go pricing model.

WANdisco Fusion provides the ability to continuously replicate data between AWS Regions, whether it's in an EMR cluster or in another AWS Region, guaranteeing strong consistency with data residing in either Region. 

The replication of Amazon EMR data to another AWS Region can provide low-cost disaster recovery storage. It can also provide the ability for burst-out processing if another EMR cluster is required for additional compute resources or location suitability.

Assumptions and Prerequisites

Prerequisites

Architecture

Source technology stack

  • Data lake on AWS with Amazon S3 and Amazon EMR

  • WANdisco Fusion 

  • WANdisco Fusion EMR client

Source architecture

  • An EMR cluster with the WANdisco EMR client installed

  • A virtual private cloud (VPC) configured with an Availability Zone

  • AWS inter-region VPC peering that enables network connectivity between the source and target AWS Regions

  • An AWS Identity and Access Management (IAM) role to control access to the resources created (this role is used to control WANdisco Fusion access to Amazon S3 for data synchronization)

  • AWS Auto Scaling to establish the initial configuration and connectivity between instances

  • An S3 bucket to store the content that is being synchronized by WANdisco Fusion

Target technology stack

  • Data lake on AWS with Amazon S3

  • WANdisco Fusion 

Target architecture

  • A VPC configured with an Availability Zone

  • AWS inter-region VPC peering that enables network connectivity between the source and target AWS Regions

  • An IAM role to control access to the resources created (this role is used to control WANdisco Fusion access to Amazon S3 for data synchronization)

  • AWS Auto Scaling to establish the initial configuration and connectivity between instances

  • An S3 bucket to store the content that is being synchronized by WANdisco Fusion

Source and target architecture

AWS recommends deploying workloads into private subnets for security purposes, and WANdisco Fusion on AWS is launched on EC2 instances within a VPC.

Tools Used

WANdisco Fusion -  WANdisco Fusion is a software application that allows replication of data between Hadoop Compatible File System (HCFS) deployments even where clusters are running different versions of Apache Hadoop. WANdisco Fusion allows replication of data between different vendor distributions and versions of Apache Hadoop. WANdisco Fusion also supports moving data between Amazon EMR and Amazon S3. WANdisco Fusion provides:

  • A virtual file system for Apache Hadoop, compatible with all Apache Hadoop applications.

  • A single, virtual namespace that integrates storage from different types of Apache Hadoop deployments, including Cloudera Distribution including Apache Hadoop (CDH), Hortonworks Data Platform (HDP), Dell EMC Isilon, Amazon S3, EMR File System (EMRFS), and MapR.

  • Storage that can be globally distributed.

  • WAN replication using WANdisco’s LiveData technology, which delivers single-copy consistent Hadoop Distributed File System (HDFS) data, replicated between geographically dispersed data centers. 

Epics

Prepare your AWS account

Tasks

Title Description Skills Predecessor
Create an AWS account. See https://aws.amazon.com for guidance. General AWS
Select AWS Regions. Use the region selector in the navigation bar to choose the AWS Region where you want to deploy on AWS. Complete this task for both the source and destination AWS Regions. General AWS
Create key pairs. Create an access/secret key pair in your preferred AWS Region, for both the source and destination AWS Regions. General AWS
If necessary, request a service limit increase. Depending on requirements, the WANdisco Fusion server(s) may require a larger EC2 instance type. General AWS

Configure the network and Amazon EC2

Tasks

Title Description Skills Predecessor
Configure the AWS Region, VPCs, Availability Zone, and subnets. Configure the AWS infrastructure, including the Availability Zone, CIDR ranges, and subnets. Complete this task for both the source and destination AWS Regions. General AWS
Configure the key pair. Configure the public/private SSH key pair, which allows you to connect securely to the instance(s) after they launch. Complete this task for both the source and destination AWS Regions. General AWS
Configure bastion host CIDR ranges (optional). Configure the CIDR IP range that allows external SSH access to the bastion host instances. Complete this task for both the source and destination AWS Regions. General AWS

Configure the VPC

Tasks

Title Description Skills Predecessor
Determine the Availability Zone. Determine the Availability Zone to use for the subnets in the Amazon VPC. Complete this task for both the source and destination AWS Regions. General AWS
Update inter-region VPC peering. Update the route tables for a VPC peering connection between the VPCs in each AWS Region. General AWS

Launch the AWS CloudFormation template

Tasks

Title Description Skills Predecessor
Access the Amazon S3 URL for the AWS CloudFormation template for WANdisco Fusion. The URL for the WANdisco Fusion template is automatically generated when you choose to launch the WANdisco Fusion - BYOL software via AWS CloudFormation in AWS Marketplace. General AWS

Configure the WANdisco Fusion EC2 instances

Tasks

Title Description Skills Predecessor
Provide the stack name for this deployment. Provide the stack name to identify this stack. General AWS
Specify the EC2 instance type. Specify the EC2 instance type for the WANdisco Fusion nodes. General AWS; System Admin
Provide the IDs for the VPC, the security group, and the VPC subnets. Specify the VPC, security group, and VPC subnet IDs to be associated with WANdisco Fusion in AWS. General AWS; System Admin
Specify the S3 bucket. Specify the name of the existing S3 bucket to use to replicate data. General AWS; System Admin
Allocate persistent storage for the WANdisco Fusion server instances. Specify the Amazon Elastic Block Store (Amazon EBS) storage to allocate for each block device, in GB, with four devices per node. General AWS; System Admin
Specify the Amazon EC2 key pair name. Specify the public/private key pair, which allows you to connect securely to your instances after they launch. General AWS; System Admin
Configure the cluster name. Provide the name of the WANdisco Fusion cluster. General AWS

Configure AWS WANdisco Fusion

Tasks

Title Description Skills Predecessor
Configure the number of WANdisco Fusion nodes. The number of WANdisco Fusion servers (EC2 instances for the AWS WANdisco Fusion cluster) must be set to at least 2 to enable high availability. General AWS
Specify the zone name and the WANdisco Fusion administrator credentials. Specify the name used to identify the zone in which the WANdisco Fusion server operates, and the name and password of the administrator user for WANdisco Fusion. General AWS
(Optional) Specify the ARN topic to publish messages. Provide the Amazon Resource Name (ARN) of the topic for emailing status notifications. General AWS
Specify the Amazon EMR version. Specify the version of the existing Amazon EMR cluster in the local AWS Region. General AWS
(Optional) Specify the URL for the WANdisco Fusion license. Specify the path to the S3 bucket (in the format s3://bucketname/path) or the URL of the license key for WANdisco Fusion. General AWS

Configure Amazon S3 security for WANdisco Fusion

Tasks

Title Description Skills Predecessor
(Optional) Determine the AWS KMS key. Provide the ARN for the AWS Key Management Service (AWS KMS) encryption key ID. Leave this field blank to disable AWS KMS encryption. General AWS

Create an EMR cluster with the WANdisco Fusion EMR client in the source Region

Tasks

Title Description Skills Predecessor
Deploy the Amazon EMR client files. Under the WANdisco Fusion settings for the Amazon EMR client, choose "Place files" so that the required files are deployed into the S3 bucket. General AWS; System Admin
Create the EMR cluster using advanced options. Use the Amazon EMR (version 5.14.0) release under the "Create Cluster, Step 1" section with advanced options. General AWS
Edit the software settings. Select to load JSON from Amazon S3, and enter the path to the "Amazon EMR_config_JSON" file in the S3 bucket. General AWS
Add the bootstrap action. Select "Custom action" and enter the path for the emrFusionClientScript.sh file in the S3 bucket. General AWS

Start WANdisco Fusion induction (from Amazon EMR in source Region to Amazon S3 in destination Region)

Tasks

Title Description Skills Predecessor
Configure the FQDN of the WANdisco Fusion node in the destination AWS Region. Provide the fully qualified domain name of the WANdisco Fusion node in the destination AWS Region. System Admin

References and Help

References

AWS Marketplace links

Contact and help

Migration Pattern Library Support: aws-mpl@amazon.com