Getting started with notebooks in AWS Glue Studio - AWS Glue Studio

Getting started with notebooks in AWS Glue Studio

When you start a notebook through AWS Glue Studio, all the configuration steps are done for you so that you can explore your data and start developing your job script after only a few seconds.

The following sections describe how to use the AWS Glue Studio to create notebooks for ETL jobs.

Granting permissions for the IAM role

Setting up AWS Glue Studio is a pre-requisite to using notebooks. For more information on setting up roles for AWS Glue Studio see Review IAM permissions needed for the AWS Glue Studio user.

The role you will use to use notebooks requires three things:

  • A trust relationship with AWS Glue for the sts:AssumeRole action and, if you want tagging then sts:TagSession.

  • An IAM policy containing all the API operations for notebooks, AWS Glue, and interactive sessions.

  • An IAM policy for a pass role since the role needs to be able to pass itself from the notebook to interactive sessions.

Actions needed for a trust relationship with AWS Glue

When starting a notebook session, you must add the sts:AssumeRole to the trust relationship of the role that is passed to the notebook. If your session includes tags, you must also pass the sts:TagSession action. Without these actions, the notebook session cannot start.

Policies containing the API operations for notebooks

The following sample policy describes the required AWS IAM permissions for notebooks.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:StartNotebook", "glue:TerminateNotebook", "glue:GlueNotebookRefreshCredentials", "glue:DeregisterDataPreview", "glue:GetNotebookInstanceStatus", "glue:GlueNotebookAuthorize" ], "Resource": "*" } ] }

You can use the following IAM policies to allow access to specific resources:

  • AwsGlueSessionUserRestrictedNotebookServiceRole: Provides full access to all AWS Glue resources except for sessions. Allows users to create and use only the notebook sessions that are associated with the user. This policy also includes other permissions needed by AWS Glue to manage AWS Glue resources in other AWS services.

  • AwsGlueSessionUserRestrictedNotebookPolicy: Provides permissions that allows users to create and use only the notebook sessions that are associated with the user. This policy also includes permissions to explicitly allow users to pass a restricted AWS Glue session role.

IAM policy for a pass role

When you create a notebook with a role, that role is then passed to interactive sessions so that the same role can be used in both places. As such, the iam:PassRole permission needs to be part of the role's policy.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::090000000210:role/<role_name>" } ] }

Creating an ETL job using notebooks in AWS Glue Studio

To start using notebooks in the AWS Glue Studio console

  1. Attach AWS Identity and Access Management policies to the AWS Glue Studio user and create an IAM role for your ETL job and notebook, as instructed in Set up IAM permissions for AWS Glue Studio.

  2. Configure additional IAM security for notebooks, as described in Granting permissions for the IAM role.

  3. Open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.

    Note

    Check that your browser does not block third-party cookies. Any browser that blocks third party cookies either by default or as a user-enabled setting will prevent notebooks from launching. For more information on managing cookies, see:

  4. Choose the Jobs link in the left-side navigation menu.

  5. Choose Jupyter notebook and then choose Create to start a new notebook session.

  6. On the Create job in Jupyter notebook page, provide the job name, and choose the IAM role to use. Choose Create job.

    After a short time period, the notebook editor appears.

  7. After you add the code you must execute the cell to initiate a session. There are multiple ways to execute the cell:

    • Press the play button.

    • Use a keyboard shortcut:

      • On MacOS, Command + Enter to run the cell.

      • On Windows, Shift + Enter to run the cell.

    For information about writing code using a Jupyter notebook interface, see The Jupyter Notebook User Documentation .

  8. To test your script, run the entire script, or individual cells. Any command output will be displayed in the area beneath the cell.

  9. After you have finished developing your notebook, you can save the job and then run it. You can find the script in the Script tab. Any magics you added to the notebook will be stripped away and won't be saved as part of the script of the generated AWS Glue job. AWS Glue Studio will auto-add a job.commit() to the end of your generated script from the notebook contents.

    For more information about running jobs, see Start a job run.

Notebook editor components

The notebook editor interface has the following main sections.

  • Notebook interface (main panel) and toolbar

  • Job editing tabs

The notebook editor

The AWS Glue Studio notebook editor is based on the Jupyter Notebook Application. The AWS Glue Studio notebook interface is similar to that provided by Juypter Notebooks, which is described in the section Notebook user interface . The notebook used by interactive sessions is a Jupyter Notebook.

Although the AWS Glue Studio notebook is similar to Juptyer Notebooks, it differs in a few key ways:

  • currently, the AWS Glue Studio notebook cannot install extensions

  • you cannot use multiple tabs; there is a 1:1 relationship between a job and a notebook

  • the AWS Glue Studio notebook does not have the same top file menu that exists in Jupyter Notebooks

  • currently, the AWS Glue Studio notebook only runs with the AWS Glue kernel. Note that you cannot update the kernel on your own.

AWS Glue Studio job editing tabs

The tabs that you use to interact with the ETL job are at the top of the notebook page. They are similar to tabs that appear in the visual job editor of AWS Glue Studio, and they perform the same actions.

  • Notebook – Use this tab to view the job script using the notebook interface.

  • Job details – Configure the environment and properties for the job runs.

  • Runs – View information about previous runs of this job.

  • Schedules – Configure a schedule for running your job at specific times.

Saving your notebook and job script

You can save your notebook and the job script you are creating at any time. Simply choose the Save button in the upper right corner, the same as if you were using the visual or script editor.

When you choose Save, the notebook file is saved in the default locations:

  • By default, the job script is saved to the Amazon S3 location indicated in the Job Details tab, under Advanced properties, in the Job details property Script path. Job scripts are saved in a subfolder named Scripts.

  • By default, the notebook file (.ipynb) is saved to the Amazon S3 location indicated in the Job Details tab, under Advanced properties, in the Job details Script path. Notebook files are saved in a subfolder named Notebooks.

Note

When you save the job, the job script contains only the code cells from the notebook. The Markdown cells and magics aren't included in the job script. However, the .ipynb file will contain any markdown and magics.

After you save the job, you can then run the job using the script that you created in the notebook.

Managing notebook sessions

Notebooks in AWS Glue Studio are based on the interactive sessions feature of AWS Glue. There is a cost for using interactive sessions. To help manage your costs, you can monitor the sessions created for your account, and configure the default settings for all sessions.

Change the default timeout for all notebook sessions

By default, the provisioned AWS Glue Studio notebook times out after 12 hours if the notebook was launched and no cells have been executed. There is no cost associated to it and the timeout is not configurable.

Once you execute a cell this will start an interactive session. This session has a default timeout of 48 hours. This timeout can be configured by passing an %idle_timeout magic before executing a cell.

To modify the default session timeout for notebooks in AWS Glue Studio

  1. In the notebook, enter the %idle_timeout magic in a cell and specify the timeout value in minutes.

  2. For example: %idle_timeout 15 will change the default timeout to 15 minutes. If the session is not used in 15 minutes, the session is automatically stopped.

Installing additional Python modules

If you would like to install additional modules to your session using pip you can do so by using %additional_python_modules to add them to your session:

%additional_python_modules awswrangler, s3://mybucket/mymodule.whl

All arguments to additional_python_modules are passed to pip3 install -m <>

To view a list of available Python modules see Using Python Libraries with AWS Glue

Changing AWS Glue Configuration

You can use magics to control AWS Glue job configuration values. If you want to change a job configuration value you have to use the proper magic in the notebook.

AWS Glue supports various worker types. You can set the worker type with %worker_type. For example: %worker_type G.2X . The default is G.1X.

You can also specify the Number of workers with %number_of_workers. For example, to specify 40 workers: %number_of_workers 40.

For more information see Defining Job Properties

Stop a notebook session

To stop a notebook session, use the magic %stop_session.

If you navigate away from the notebook in the AWS console, you will receive a warning message where you can choose to stop the session.