Data Lake Solution
Data Lake Solution

Appendix B: Solution Components

AWS KMS Key

The data lake AWS KMS key (alias: datalake) is created to provide encryption of all dataset objects that the solution owns and stores in Amazon S3. Additionally, the AWS KMS key is used to encrypt the secret access key in each user’s Amazon Cognito user pool record for API access to the data lake.

Amazon CloudFront

The solution configures an Amazon CloudFront distribution to serve HTTPS requests for the data lake console.

Amazon S3

The solution uses a default Amazon S3 bucket to store datasets and manifest files associated with packages that users upload to the data lake. Additionally, the bucket stores the manifest files generated for a user when they check out their cart, which is a collection of packages. All access to this bucket (get and put actions from the package and manifest microservices) is controlled via signed URLs. All objects stored in this bucket are encrypted using the data lake AWS KMS key.

A second Amazon S3 bucket hosts the data lake console. This console is a static website that uses Amazon Cognito for user authentication. End users do not have direct access to the S3 endpoint. All access should be done via the Amazon CloudFront distribution.

AWS Glue and Amazon Athena

This solution automatically configures an AWS Glue crawler within each data package and schedules a daily scan to keep track of the changes. The crawlers crawl through your datasets and inspect portions of them to infer a data schema and persist the output as one or more metadata tables that are defined in your AWS Glue Data Catalog.

Once created, this catalog provides a unified metadata repository across a variety of data sources and formats, integrating with Amazon Athena and Amazon Redshift Spectrum to interactively query and analyze data directly in your data lake, and with Amazon EMR, AWS Glue extract, transform, and load (ETL) jobs and any application compatible with the Apache Hive data warehouse so you can categorize, clean, enrich, and move your data.

Note

If you previously created databases and tables using Athena or Amazon Redshift Spectrum, you must upgrade Athena to the AWS Glue Data Catalog. If you are new to Athena, you don't need to make any changes. For more information, see upgrading to the AWS Glue Data Catalog.

Amazon Cognito User Pool

The data lake console is secured for user access with Amazon Cognito and provides an administrative interface for managing data lake users through integration with Amazon Cognito user pools. Only Administrators can create users and groups. Once users are created the solution will automatically send an invitation to the user to join the data lake. Note that if you use the federated template, all administrative tasks should be done on the AD server. When an Administrator creates a new user, he/she will assign the user one of the following roles, with the associated permissions:

  • Member: The member role can perform non-administrative actions within the data lake. These actions include the following:

    • View and search packages if the owner or visible package is in a member group

    • View and search all packages in the data lake

    • Add, remove, and generate manifests for packages in their cart

    • Create, update, and delete packages they created

    • Create and update metadata on the packages they created

    • Add and remove datasets from the packages they created

    • View their data lake profile and API access information

    • Generate a secret access key if an Administrator has granted them API access

  • Admin: The admin role has full access to the data lake. The admin role can perform the following actions in addition to the member role actions:

    • Create user invitations and assign users to one or more groups

    • Create, update, delte groups

    • Update, disable, enable, and delete data lake users

    • Assign, delete, and reassign users to groups

    • Create, revoke, enable, and disable a user's API access

    • Update data lake settings

    • Create, update, and delete governance settings

Data Lake API and Microservices

The data lake API receives requests via HTTPS. When an API request is made, Amazon API Gateway leverages a custom authorizer (AWS Lambda function) to ensure that all requests are authorized.

The data lake microservices is a series of AWS Lambda functions that provide the business logic and data access layer for all data lake operations. Each AWS Lambda function assumes an AWS IAM role with least privilege access (minimum permissions necessary) to perform its designated functions. The following sections outline each data lake microservice.

Admin Microservice

The data-lake-admin-service is an AWS Lambda function that processes data lake API requests sent to the /admin/* endpoints. The admin microservice handles all administrative services including user and group management, general settings, governance settings, API keys, and role authorization for all operations within the data lake.

Cart Microservice

The data-lake-cart-service is an AWS Lambda function that processes data lake API requests sent to the /cart/* endpoints. The cart microservice handles all cart operations including item lists, adding items, removing items, and generating manifests for user carts.

Manifest Microservice

The data-lake-manifest-service is an AWS Lambda function that manages import and export of manifest files. The manifest microservice uploads import manifest files, which allows existing Amazon S3 content to be bulk imported into a package. It also generates export manifest files for each package in a user's cart at checkout.

Package Microservice

The data-lake-package-service is an AWS Lambda function that processes data lake API requests sent to /packages/* endpoints. The package microservice handles all package operations including list, add package, remove package, update package, list metadata, add metadata, update metadata, list datasets, add dataset, remove dataset, process manifest, run AWS Glue on-demand crawler, list and access AWS Glue tables, and view dataset on Amazon Athena.

Search Microservice

The data-lake-search-service is an AWS Lambda function that process data lake API requests sent to /search/* endpoints. The search microservice handles all search operations including query, index document, and remove indexed document.

Profile Microservice

The data-lake-profile-service is an AWS Lambda function that processes data lake API requests sent to /profile/* endpoints. The profile microservice handles all profile operations for data lake users, including get and generate secret access key.

Logging Microservice

The data-lake-logging-service is an AWS Lambda function that interfaces between the data lake microservices and Amazon CloudWatch Logs. Each microservice sends operations and access events to the logging service, which records the events in Amazon CloudWatch Logs. You can access this log (datalake/audit-log) in the CloudWatch console.

Amazon DynamoDB Tables

The data lake solution uses Amazon DynamoDB tables to persist metadata for the data packages, settings, and user cart items. The following tables are provisioned during deployment and only accessed via the data lake microservices:

  • data-lake-packages: persistent store for data package title and description, and a list of groups that can access the package

  • data-lake-metadata: persistent store for metadata tag values associated with packages

  • data-lake-datasets: persistent store for dataset pointers to Amazon S3 objects

  • data-lake-cart: persistent store for user cart items

  • data-lake-keys: persistent store for user access key ID references

  • data-lake-settings: persistent store for data lake configuration and governance settings

Amazon Elasticsearch Service Cluster

The solution uses an Amazon Elasticsearch Service cluster to index data lake package data for searching. The cluster is accessible only by the search microservice and an IP address that the customer designates during initial deployment.