Architecture Overview - Data Lake Solution

Architecture Overview

Deploying this solution builds the following environment in the AWS Cloud.

        Data lake solution - architectural overview

Figure 1: Data lake solution architecture on AWS

The solution uses AWS CloudFormation to deploy the infrastructure components supporting this data lake reference implementation. At its core, this solution implements a data lake API, which leverages Amazon API Gateway to provide access to data lake microservices (AWS Lambda functions). These microservices provide the business logic to create data packages, upload data, search for existing packages, add interesting data to a cart, generate data manifests, and perform administrative functions. These microservices interact with Amazon S3, AWS Glue, Amazon Athena, Amazon DynamoDB, Amazon Elasticsearch Service (Amazon ES), and Amazon CloudWatch Logs to provide data storage, management, and audit functions.

The solution creates a data lake console and deploys it into an Amazon S3 bucket configured for static website hosting, and configures an Amazon CloudFront distribution to be used as the solution’s console entrypoint. During initial configuration, the solution also creates a default administrator role and sends an access invite to a customer-specified email address. Note that if you deploy a federated stack, you must manually create user and admin groups. For information on Active Directory, see Appendix A. For information on Okta, see Appendix B.

The solution uses an Amazon Cognito user pool to manage user access to the console and the data lake API. See Appendix C for detailed information on each of the solution's components.