Connecting to a Spark Connect session via API
Spark Connect provides a thin client interface that you can use to connect to AWS Glue interactive sessions from any environment that supports PySpark. Unlike Livy-based sessions, Spark Connect does not require a Jupyter kernel installation. You can connect directly from a Python script, a notebook, or an IDE such as VS Code.
This topic describes how to create a Spark Connect session, retrieve the connection endpoint, and connect from your local environment.
Connecting with Boto3 SDK
Use the AWS SDK for Python (Boto3) to create a session, get the Spark Connect endpoint, and connect using PySpark.
-
Install the required libraries.
pip install boto3 "pyspark==3.5.6" pandas pyarrow grpcio grpcio-status -
Create an AWS Glue Spark Connect session.
import time import urllib.parse import boto3 from pyspark.sql import SparkSession glue = boto3.client("glue", region_name="us-east-1") glue.create_session( Id="my-spark-connect-session", Role="arn:aws:iam::123456789012:role/GlueRole", Command={"Name": "glueetl"}, GlueVersion="5.1", SessionType="SPARK_CONNECT", IdleTimeout=60, Timeout=60, DefaultArguments={"--language": "python"}, ) -
Wait for the session to reach
READYstate. The session typically takes 20–30 seconds to reachREADYstate.session_id = "my-spark-connect-session" while True: status = glue.get_session(Id=session_id)["Session"]["Status"] if status == "READY": break time.sleep(10) -
Retrieve the Spark Connect endpoint and build the authenticated remote URL.
def get_remote_url(glue, session_id): """Get endpoint and construct the authenticated remote URL.""" resp = glue.get_session_endpoint(SessionId=session_id) sc = resp["SparkConnect"] token = urllib.parse.quote(sc["AuthToken"], safe="") return f"{sc['Url']}:443/;use_ssl=true;x-aws-proxy-auth={token}", sc["AuthTokenExpirationTime"] remote, token_expiry = get_remote_url(glue, session_id)The
get_session_endpointresponse includes:-
Url– The Spark Connect endpoint URL. -
AuthToken– A temporary authentication token for the session. -
AuthTokenExpirationTime– The time at which the token expires, represented as a Unix epoch timestamp.
-
-
Start the Spark session.
spark = SparkSession.builder.remote(remote).getOrCreate() spark.version -
(Optional) Set up automatic token refresh. The authentication token expires after the duration indicated by
AuthTokenExpirationTime. Use a background thread to refresh the token before it expires.import threading def reconnect(): global spark remote, token_expiry = get_remote_url(glue, session_id) spark = SparkSession.builder.remote(remote).getOrCreate() # Schedule next refresh 60 seconds before expiry expires_in = token_expiry - time.time() delay = max(expires_in - 60, 10) timer = threading.Timer(delay, reconnect) timer.daemon = True timer.start() # Schedule the first refresh expires_in = token_expiry - time.time() delay = max(expires_in - 60, 10) timer = threading.Timer(delay, reconnect) timer.daemon = True timer.start() -
Read data from the AWS Glue Data Catalog.
df = spark.read.table("my_database.my_table") df.show()
Connecting with Spark utilities
If you use
Notebooks
in Amazon SageMaker Unified Studio, the sagemaker-studio Python
library provides Spark utilities that simplify connecting to AWS Glue Spark Connect
sessions.
-
Install the
sagemaker-studiolibrary.pip install sagemaker-studio -
Initialize a Spark session using your AWS Glue Spark Connect connection.
from sagemaker_studio import sparkutils spark = sparkutils.init(connection_name="my-glue-spark-connection") -
Read data from the AWS Glue Data Catalog.
df = spark.read.table("my_database.my_table") df.show()
For more information about the Spark utilities module, see
Spark
Utilitiessagemaker-studio library documentation.