Getting started with the lakehouse architecture of Amazon SageMaker - lakehouse architecture

Getting started with the lakehouse architecture of Amazon SageMaker

This guide helps you accomplish common tasks like finding relevant datasets, running SQL queries against your data warehouse and data lake simultaneously, collaborating with team members through data publishing, and maintaining data governance standards. Your administrator will provide the necessary access permissions and project roles to get started.

Prerequisites

Create a project

You can create a project from a project profile, which defines a template for projects in your domain. To use lakehouse architecture, your project must be created using either Data analytics and AI-ML model development or SQL analytics project profile. For more information about creating a project, see Create a project from lakehouse architecture User Guide.

When using lakehouse architecture, you can create the following resources in the lakehouse:

  1. Databases in AWS Glue Data Catalog

    lakehouse architecture is implemented on AWS Glue and AWS Lake Formation in your AWS account.

  2. A catalog to store data in Redshift Managed Storage (RMS) format

    You will create a catalog in RMS format. To view the catalog, navigate to the AWS Lake Formation console at https://console.aws.amazon.com/lakeformation/, you should be able to see the catalog from the Catalogs list.

  3. Provisioning permissions

    You will create an IAM role when you create a project. Each project has a dedicated IAM role. This IAM role has permission to the resources that are created from this project. The Amazon Resource Name (ARN) of this IAM role is visible from Project details section of the Project overview page.

Browse data

You can browse data in lakehouse architecture by completing the following steps.

To browse data
  1. Choose a project to view the data.

  2. On project page, from the left navigation, choose Data. This opens the Data explorer in the middle of the page.

    The Data explorer includes: Lakehouse, Redshift, and S3.

  3. Expand Lakehouse to view catalogs, databases, tables.

Upload data

You can upload data in CSV or JSON format to a catalog. To upload data, follow the instructions in Uploading data.

After uploading data is complete, you will see the table listed within the database under AwsDataCatalog.

Query data

You can query data using supported query editor.

To query data
  1. On Lakehouse, choose AwsDataCatalog on top. Expand the catalog to view the list of databases. Choose a database.

  2. From a selected database, choose a table. Then choose the three dot menu to the right of the table to view supported tools for data query.

  3. Choose Query with Athena. This opens the Data explorer page where you can run SQL queries. You might find information in SQL reference for Athena helpful.

  4. Choose Query with Amazon Redshift. This opens the Data explorer page where you can run SQL queries. You might find information in Querying a database using the query editor v2 helpful.

To subscribe an asset, see Request subscription to assets in Amazon SageMaker Unified Studio.

To publish data to the catalog from the lakehouse inventory, see Publishing data in lakehouse architecture.