Architecture Overview - Data Lake on AWS

Architecture Overview

Deploying this solution builds the following environment in the AWS Cloud.

        Data lake solution - architectural overview

Figure 1: Data Lake on AWS architecture on AWS

The solution uses AWS CloudFormation to deploy the infrastructure components supporting this data lake reference implementation. At its core, this solution implements a data lake API, which leverages Amazon API Gateway to provide access to data lake microservices (AWS Lambda functions). These microservices provide the business logic to create data packages, upload data, search for existing packages, add interesting data to a cart, generate data manifests, and perform administrative functions. These microservices interact with Amazon S3, AWS Glue, Amazon Athena, Amazon DynamoDB, Amazon OpenSearch Service (successor to Amazon OpenSearch Service), and Amazon CloudWatch Logs to provide data storage, management, and audit functions.

The solution creates a data lake console and deploys it into an Amazon S3 bucket configured for static website hosting, and configures an Amazon CloudFront distribution to be used as the solution’s console entrypoint. During initial configuration, the solution also creates a default administrator role and sends an access invite to a customer-specified email address. Note that if you deploy a federated stack, you must manually create user and admin groups. For information on Active Directory, refer to Appendix A. For information on Okta, refer to Appendix B.

The solution uses an Amazon Cognito user pool to manage user access to the console and the data lake API. Refer to Appendix C for detailed information on each of the solution's components.