Prepare data with SQL in Studio - Amazon SageMaker

Prepare data with SQL in Studio

Amazon SageMaker Studio provides a built-in SQL extension. This extension allows data scientists to perform tasks such as sampling, exploratory analysis, and feature engineering directly within their JupyterLab notebooks. It leverages AWS Glue connections to maintain a centralized data source catalog. The catalog stores metadata about various data sources. Through this SQL environment, data scientists can browse data catalogs, explore their data, author complex SQL queries, and further process the results in Python.

This section walks through configuring the SQL extension in Studio. It describes the capabilities enabled by this SQL integration and provides instructions for running SQL queries in JupyterLab notebooks.

To enable SQL data analysis, administrators first need to configure AWS Glue connections to select data sources. These connections allow data scientists to seamlessly access authorized datasets from within JupyterLab. With access set up, JupyterLab users can:

  • View and browse pre-configured data sources.

  • Search, filter, and inspect database information elements such as tables, schemas, and columns.

  • Auto-generate the connection parameters to a data source.

  • Create complex SQL queries using the syntax-highlighting, auto-completion, and SQL formatting features of the extension's SQL editor.

  • Run SQL statements from JupyterLab notebook cells.

  • Retrieve the results of SQL queries as pandas DataFrames for further processing, visualization, and other machine learning tasks.

You can access the extension by choosing the SQL extension icon. icon in the left navigation pane of your JupyterLab application in Studio. Hovering over the icon displays its Data Discovery tool tip.

Important
  • The JupyterLab image in SageMaker Studio contains the SQL extension by default, starting with SageMaker Distribution 1.6. The extension works with Python and SparkMagic kernels only.

  • The extension's user interface for exploring connections and data is only available in JupyterLab within Studio. It is compatible with Amazon Redshift, Amazon Athena, and Snowflake.

  • If you are an administrator looking to configure connections to data sources for the SQL extension, follow these steps:

  • If you are a data scientist looking to browse and query your data sources using the SQL extension, ensure that your administrator has configured the connections to your data sources, then follow these steps:

    • Create a private space to launch your JupyterLab application in Studio using the SageMaker distribution image version 1.6 or higher.

    • If you are a user of the SageMaker distribution image version 1.6, load the SQL extension in a JupyterLab notebook by running %load_ext amazon_sagemaker_sql_magic in a notebook cell.

      For users of SageMaker distribution image versions 1.7 and later, no action is needed, the SQL extension loads automatically.

    • Familiarize yourself with the capabilities of the SQL extension in SQL extension features and usage.