Data Lake Solution
Data Lake Solution

Appendix A: Architectural Components


The data lake AWS KMS key (alias: datalake) is created to provide encryption of all dataset objects that the solution owns and stores in Amazon S3. Additionally, the AWS KMS key is used to encrypt the secret access key in each user’s Amazon Cognito user pool record for API access to the data lake.

Amazon S3

The solution uses a default Amazon S3 bucket to store datasets and manifest files associated with packages that users upload to the data lake. Additionally, the bucket stores the manifest files generated for a user when they check out their cart, which is a collection of packages. All access to this bucket (get and put actions from the package and manifest microservices) is controlled via signed URLs. All objects stored in this bucket are encrypted using the data lake AWS KMS key.

A second Amazon S3 bucket hosts the data lake console. This console is a static website that uses Amazon Cognito for user authentication.

Amazon Cognito User Pool

The data lake console is secured for user access with Amazon Cognito and provides an administrative interface for managing data lake users through integration with Amazon Cognito user pools. Only Administrators can create user accounts and send invitations to those users. When an Administrator creates a new user, he/she will assign the user one of the following roles, with the associated permissions:

  • Member: The member role can perform non-administrative actions within the data lake. These actions include the following:

    • View and search all packages in the data lake

    • Add, remove, and generate manifests for packages in their cart

    • Create, update, and delete packages they created

    • Create and update metadata on the packages they created

    • Add and remove datasets from the packages they created

    • View their data lake profile and API access information

    • Generate a secret access key if an Administrator has granted them API access

  • Admin: The admin role has full access to the data lake. The admin role can perform the following actions in addition to the member role actions:

    • Create user invitations

    • Update, disable, enable, and delete data lake users

    • Create, revoke, enable, and disable a user's API access

    • Update data lake settings

    • Create, update, and delete governance settings

Data Lake API and Microservices

The data lake API receives requests via HTTPS. When an API request is made, Amazon API Gateway leverages a custom authorizer (AWS Lambda function) to ensure that all requests are authorized.

The data lake microservices is a series of AWS Lambda functions that provide the business logic and data access layer for all data lake operations. Each AWS Lambda function assumes an AWS IAM role with least privilege access (minimum permissions necessary) to perform its designated functions. The following sections outline each data lake microservice.

Admin Microservice

The data-lake-admin-service is an AWS Lambda function that processes data lake API requests sent to the /admin/* endpoints. The admin microservice handles all administrative services including user management, general settings, governance settings, API keys, and role authorization for all operations within the data lake.

Cart Microservice

The data-lake-cart-service is an AWS Lambda function that processes data lake API requests sent to the /cart/* endpoints. The cart microservice handles all cart operations including item lists, adding items, removing items, and generating manifests for user carts.

Manifest Microservice

The data-lake-manifest-service is an AWS Lambda function that manages import and export of manifest files. The manifest microservice uploads import manifest files, which allows existing Amazon S3 content to be bulk imported into a package. It also generates export manifest files for each package in a user's cart at checkout.

Package Microservice

The data-lake-package-service is an AWS Lambda function that processes data lake API requests sent to /packages/* endpoints. The package microservice handles all package operations including list, add package, remove package, update package, list metadata, add metadata, update metadata, list datasets, add dataset, remove dataset, and process manifest.

Search Microservice

The data-lake-search-service is an AWS Lambda function that process data lake API requests sent to /search/* endpoints. The search microservice handles all search operations including query, index document, and remove indexed document.

Profile Microservice

The data-lake-profile-service is an AWS Lambda function that processes data lake API requests sent to /profile/* endpoints. The profile microservice handles all profile operations for data lake users, including get and generate secret access key.

Logging Microservice

The data-lake-logging-service is an AWS Lambda function that interfaces between the data lake microservices and Amazon CloudWatch Logs. Each microservice sends operations and access events to the logging service, which records the events in Amazon CloudWatch Logs. You can access this log (datalake/audit-log) in the CloudWatch console.

Amazon DynamoDB Tables

The data lake solution uses Amazon DynamoDB tables to persist metadata for the data packages, settings, and user cart items. The following tables are provisioned during deployment and only accessed via the data lake microservices:

  • data-lake-packages: persistent store for data package title and description

  • data-lake-metadata: persistent store for metadata tag values associated with packages

  • data-lake-datasets: persistent store for dataset pointers to Amazon S3 objects

  • data-lake-cart: persistent store for user cart items

  • data-lake-keys: persistent store for user access key ID references

  • data-lake-settings: persistent store for data lake configuration and governance settings

Amazon Elasticsearch Service Cluster

The solution uses an Amazon Elasticsearch Service cluster to index data lake package data for searching. The cluster is accessible only by the search microservice and an IP address that the customer designates during initial deployment.