Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks
Introduction to Jupyter Magics
Jupyter Magics are commands that can be run at the beginning of a cell or as a whole cell body.
Magics start with %
for line-magics and %%
for cell-magics.
Line-magics such as %region
and %connections
can be run with multiple magics in a cell,
or with code included in the cell body like the following example.
%region us-east-2 %connections my_rds_connection dy_f = glue_context.create_dynamic_frame.from_catalog(database='rds_tables', table_name='sales_table')
Cell magics must use the entire cell and can have the command span multiple lines. An example of %%sql
is below.
%%sql select * from rds_tables.sales_table
Magics supported by AWS Glue interactive sessions for Jupyter
The following are magics that you can use with AWS Glue interactive sessions for Jupyter notebooks.
Sessions magics
Name | Type | Description |
n/a | Return a list of descriptions and input types for all magic commands. |
%profile |
String | Specify a profile in your AWS configuration to use as the credentials provider. |
%region |
String |
Specify the AWS Region; in which to initialize a session. Default from Example: |
%idle_timeout |
Int |
The number of minutes of inactivity after which a session will timeout after a cell has been executed. The default idle timeout value for Spark ETL sessions is the default timeout, 2880 minutes (48 hours). For other session types, consult documentation for that session type. Example: |
%session_id |
n/a | Return the session ID for the running session. |
%session_id_prefix |
String |
Define a string that will precede all session IDs in the format [session_id_prefix]-[session_id]. If a session ID is not provided, a random UUID will be generated. This magic is not supported when you run a Jupyter Notebook in AWS Glue Studio. Example: |
%status |
Return the status of the current AWS Glue session including its duration, configuration and executing user / role. | |
| Stop the current session. | |
%list_sessions |
Lists all currently running sessions by name and ID. | |
%session_type |
String |
Sets the session type to one of Streaming, ETL, or Ray. Example: |
%glue_version |
String |
The version of AWS Glue to be used by this session. Example: |
Magics for selecting job types
Name | Type | Description |
%streaming |
String | Changes the session type to AWS Glue Streaming. |
%etl |
String | Changes the session type to AWS Glue ETL. |
%glue_ray | String | Changes the session type to AWS Glue for Ray. See Magics supported by AWS Glue Ray interactive sessions. |
AWS Glue for Spark config magics
The %%configure
magic is a json-formatted dictionary consisting of all configuration parameters for
a session. Each parameter can be specified here or through individual magics.
Name | Type | Description |
Dictionary |
Specify a JSON-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics.
For a list of parameters and examples on how to use |
%iam_role |
String |
Specify an IAM role ARN to execute your session with. Default from ~/.aws/configure.
Example: |
%number_of_workers |
Int |
The number of workers of a defined worker_type that are allocated when a job runs.
Example: |
%additional_python_modules |
List |
Comma separated list of additional Python modules to include in your cluster (can be from PyPI or S3). Example: |
%%tags |
String |
Adds tags to a session. Specify the tags within curly brackets { }. Each tag name pair is enclosed in parentheses (" ") and separated by a comma (,).
Use the
%%assume_role |
Dictionary |
Specify a json-formatted dictionary or an IAM role ARN string to create a session for cross-account access. Example with ARN:
Example with credentials:
%%configure cell magic arguments
The %%configure
magic is a json-formatted dictionary consisting of all configuration parameters for a session.
Each parameter can be specified here or through individual magics.
See below for examples for arguments supported by the %%configure
cell magic. Use the --
for run arguments specified for the job. Example:
%%configure { "--user-jars-first": "true", "--enable-glue-datacatalog": "false" }
For more information on job parameters, see Job parameters.
Session Configuration
Parameter | Type | Description |
max_retries |
Int | The maximum number of times to retry this job if it fails.
max_concurrent_runs |
Int | The maximum number of concurrent runs allowed for a job.
Session parameters
Parameter | Type | Description |
--enable-spark-ui |
Boolean | Enable Spark UI to monitor and debug AWS Glue ETL jobs.
--spark-event-logs-path |
String | Specifies an Amazon S3 path. When using the Spark UI monitoring feature.
--script_location |
String | Specifies the S3 path to a script that executes a job.
String | The name of a AWS Glue security configuration Example:
--job-language |
String | The script programming language. Accepts a value of 'scala' or 'python'. Default is 'python'.
--class |
String | The Scala class that serves as the entry point for your Scala script. Default is null.
--user-jars-first |
Boolean | Prioritizes the customer's extra JAR files in the classpath. Default is null.
--use-postgres-driver |
Boolean | Prioritizes the Postgres JDBC driver in the class path to avoid a conflict with the Amazon Redshift JDBC driver.
Default is null.
--extra-files |
List(string) | The Amazon S3 paths to additional files, such as configuration files that AWS Glue copies to the working directory of
your script before executing it.
--job-bookmark-option |
String | Controls the behavior of a job bookmark. Accepts a value of 'job-bookmark-enable', 'job-bookmark-disable' or
'job-bookmark-pause'. Default is 'job-bookmark-disable'.
--TempDir |
String | Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job. Default is null.
--enable-s3-parquet-optimized-committer |
Boolean | Enables the EMRFS Amazon S3-optimized committer for writing Parquet data into Amazon S3. Default is 'true'.
--enable-rename-algorithm-v2 |
Boolean | Sets the EMRFS rename algorithm version to version 2. Default is 'true'.
--enable-glue-datacatalog |
Boolean | Enables you to use the AWS Glue Data Catalog as an Apache Spark Hive metastore.
--enable-metrics |
Boolean | Enables the collection of metrics for job profiling for job run. Default is 'false'.
--enable-continuous-cloudwatch-log |
Boolean | Enables real-time continuous logging for AWS Glue jobs. Default is 'false'.
--enable-continuous-log-filter |
Boolean | Specifies a standard filter or no filter when you create or edit a job enabled for continuous logging. Default is 'true'.
--continuous-log-stream-prefix |
String | Specifies a custom Amazon CloudWatch log stream prefix for a job enabled for continuous logging. Default is null.
--continuous-log-conversionPattern |
String | Specifies a custom conversion log pattern for a job enabled for continuous logging. Default is null.
--conf |
String | Controls Spark config parameters. It is for advanced use cases. Use --conf before each parameter.
timeout | Int | Determines the maximum amount of time that the Spark session should wait for a statement to complete before
terminating it.
auto-scaling | Boolean | Determines whether or not to use auto-scaling.
Spark jobs (ETL & streaming) magics
Name | Type | Description |
%worker_type |
String | Standard, G.1X, or G.2X. number_of_workers must be set too. The default worker_type is G.1X. |
%connections |
List |
Specify a comma-separated list of connections to use in the session. Example:
%extra_py_files |
List | Comma separated list of additional Python files from Amazon S3. |
%extra_jars |
List | Comma-separated list of additional jars to include in the cluster. |
%spark_conf |
String | Specify custom spark configurations for your session.
For example, %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer . |
Magics for Ray jobs
Name | Type | Description |
%min_workers |
Int |
The minimum number of workers that are allocated to a Ray job. Default: 1.
Example: |
%object_memory_head |
Int | The percentage of free memory on the instance head node after a warm start. Minimum: 0. Maximum: 100.
Example: |
%object_memory_worker | Int | The percentage of free memory on the instance worker nodes after a warm start. Minimum: 0. Maximum: 100.
Example: |
Action magics
Name | Type | Description |
%%sql |
String |
Run SQL code. All lines after the initial
Example: |
%matplot |
Matplotlib figure |
Visualize your data using the matplotlib library. Example:
%plotly |
Plotly figure |
Visualize your data using the plotly library. Example:
Naming sessions
AWS Glue interactive sessions are AWS resources and require a name. Names should be unique for each session and may be restricted by your IAM administrators. For more information, see Interactive sessions with IAM. The Jupyter kernel automatically generates unique session names for you. However sessions can be named manually in two ways:
Using the AWS Command Line Interface config file located at
. See Setting Up AWS Config with the AWS Command Line Interface. -
Using the
magics. See Magics supported by AWS Glue interactive sessions for Jupyter .
A session name is generated as follows:
When the prefix and session_id are provided: the session name will be {prefix}-{UUID}.
When nothing is provided: the session name will be {UUID}.
Prefixing session names allows you to recognize your session when listing it in the AWS CLI or console.
Specifying an IAM role for interactive sessions
You must specify an AWS Identity and Access Management (IAM) role to use with AWS Glue ETL code that you run with interactive sessions.
The role requires the same IAM permissions as those required to run AWS Glue jobs. See Create an IAM role for AWS Glue for more information on creating a role for AWS Glue jobs and interactive sessions.
IAM roles can be specified in two ways:
Using the AWS Command Line Interface config file located at
(Recommended). For more information, see Configuring sessions with ~/.aws/config .Note
When the
magic is used, the configuration forglue_iam_role
of that profile is honored. -
Using the %iam_role magic. For more information, see Magics supported by AWS Glue interactive sessions for Jupyter .
Configuring sessions with named profiles
AWS Glue interactive sessions uses the same credentials as the AWS Command Line Interface or boto3, and interactive sessions
honors and works with named profiles like the AWS CLI
found in ~/.aws/config
(Linux and MacOS) or %USERPROFILE%\.aws\config
For more information, see
Using named profiles
Interactive sessions takes advantage of named profiles by allowing the AWS Glue Service Role and
Session ID Prefix to be specified in a profile. To configure a profile role, add a line for the
key and/or session_id_prefix
to your named profile as shown below.
The session_id_prefix
does not require quotes. For example, if you want to add a
, enter the value of the session_id_prefix=myprefix
[default] region=us-east-1 aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY glue_iam_role=arn:aws:iam::<AccountID>:role/<GlueServiceRole> session_id_prefix=<prefix_for_session_names> [user1] region=eu-west-1 aws_access_key_id=AKIAI44QH8DHBEXAMPLE aws_secret_access_key=je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY glue_iam_role=arn:aws:iam::<AccountID>:role/<GlueServiceRoleUser1> session_id_prefix=<prefix_for_session_names_for_user1>
If you have a custom method of generating credentials,
you can also configure your profile to use the credential_process
parameter in your
file. For example:
[profile developer] region=us-east-1 credential_process = "/Users/Dave/" --username helen
You can find more details about sourcing credentials through the credential_process
parameter here:
Sourcing credentials with an external process.
If a region or iam_role
are not set in the profile that you are using,
you must specify them using the %region
and %iam_role
magics in the first cell that you run.