AWS Prescriptive Guidance
Patterns

Deploy a hybrid data lake on AWS with WANdisco Fusion, Amazon S3, and Amazon Athena

R Type :RePlatform

source :Data: Hub-Lake-Warehouse

target :Hadoop - Hybrid Data Lake on AWS with WANdisco

tags :amazon vpc, amazon ec2, amazon s3, amazon athena, apache, hadoop, hadoop distributed file system, hdfs, docker, container, dataset, aws auto scaling, hybrid, mapr, amazon emr (optional), cdh, hortonworks data platform, hdp, isilon

Summary

This pattern provides guidance on integrating on-premises Apache Hadoop clusters with a data lake on the Amazon Web Services (AWS) Cloud using WANdisco Fusion, Amazon Simple Storage Service (Amazon S3), and Amazon Athena. This hybrid data lake architecture combines on-premises components with AWS components and supports burst-out processing in the cloud. The data lake on AWS incorporates Amazon Elastic Compute Cloud (Amazon EC2) compute capacity and Amazon S3 storage, designed within an AWS Auto Scaling group and virtual private cloud (VPC). 

You can deploy a Docker container or use your own on-premises Hadoop cluster. WANdisco has the ability to replicate data from Docker (or your on-premises Hadoop environment) to Amazon S3 continuously, ensuring strong consistency between data residing on premises and data in the cloud. You can use Amazon Athena to analyze and view the data that has been replicated (with capacity for burst-out processing scenarios). 

You can customize this pattern to enable a disaster recovery scenario for your on-premises Hadoop cluster. After deployment, provision an Amazon EMR cluster that references the data that is replicated into Amazon S3. The provisioning of Amazon EMR could serve as a low-cost disaster recovery environment in the event of a failure of your on-premises Hadoop cluster. (Note that this pattern doesn't deploy Amazon EMR.)  

This pattern leverages the AWS Quick Start developed by Sturdy in collaboration with WANdisco and Amazon Web Services (AWS). Sturdy and WANdisco are APN Partners.

 

WANdisco Fusion on AWS

WANdisco Fusion is a software application that allows Apache Hadoop deployments to replicate Hadoop Distributed File System (HDFS) data between Hadoop clusters that are running different, even incompatible, versions of Hadoop. It is also possible to replicate data between different vendor distributions and versions of Hadoop, or between Hadoop and Amazon S3. WANdisco Fusion provides:

  • A virtual file system for Hadoop, compatible with all Hadoop applications.  

  • A single, virtual namespace that integrates storage from different types of Hadoop deployments, including Cloudera Distribution including Apache Hadoop (CDH), Hortonworks Data Platform (HDP), Dell EMC Isilon, Amazon S3, EMR File System (EMRFS), and MapR.

  • Storage that can be globally distributed.  

  • WAN replication using WANdisco’s LiveData technology, which delivers single-copy consistent HDFS data, replicated between geographically dispersed data centers.  

Assumptions and Prerequisites

Prerequisites

  • An active AWS account 

  • A subscription to the Amazon Machine Image (AMI) for WANdisco Fusion in AWS Marketplace. The WANdisco Fusion software is provided with the Bring Your Own License (BYOL) model. To continue using WANdisco Fusion beyond the 14-day trial period, you must purchase a license by contacting WANdisco.  

Architecture

Source technology stack

  • WANdisco Fusion (Apache Hadoop) 

Target technology stack

  • A virtual private cloud (VPC) configured with public subnets that span multiple Availability Zones for high availability.

  • An internet gateway to provide access to the internet.

  • An AWS Identity and Access Management (IAM) role to control access to the resources created. This role is used to control Athena access to Amazon S3 for data analysis, and WANdisco Fusion access to Amazon S3 for data synchronization. 

  • In the public subnets, WANdisco Fusion server instances in an AWS Auto Scaling group, functioning as a single clustered service. This pattern uses AWS Auto Scaling to establish the initial configuration and connectivity between instances in different Availability Zones. When you replicate data using WANdisco Fusion, moving the Fusion server to a new VM will require manual reconfiguration of some network settings.  

  • (Optional) An on-premises WANdisco server deployed in a Docker container, to demonstrate the synchronization from HDFS to the S3 bucket in the cloud.  

  • (Optional) Amazon Athena to query and analyze the data from the local WANdisco Fusion server, which is synchronized with Amazon S3.  

  • (Optional) An S3 bucket to store the content that is being synchronized by WANdisco Fusion and the analysis information processed by Athena.  

Target architecture

Design considerations

Deploying WANdisco Fusion into public subnets makes it easier for the Docker container to communicate with a peer Fusion server in the public network.

AWS recommends deploying workloads into private subnets for security purposes. This choice requires a good understanding of Amazon VPC and how each service communicates to replicate the data. If you choose to deploy WANdisco Fusion in private subnets, this should be done in an existing VPC, so you can specify the existing private network for the WANdisco Fusion cluster. 

Tools Used

Quick Start: Hybrid Data Lake on AWS  

Epics

Assess and deploy the Quick Start

Tasks

Title Description Skills Predecessor
Launch the Quick Start, if it meets your needs. See the Quick Start deployment guide (see the References and Help section) for any pre-deployment instructions, and then launch the Quick Start from the link provided.
Customize and launch the Quick Start, if you have additional requirements. Download the AWS CloudFormation templates from the GitHub repository (see the References and Help section), modify them to meet your needs, and launch the customized templates. Default Areas
Validate the deployment. See the Quick Start deployment guide for any post-deployment and testing instructions.

References and Help

References

WANdisco Fusion

AWS Quick Starts

Contact and help

Migration Pattern Library Support: aws-mpl@amazon.com