Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic - Amazon SageMaker

Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic

Users of Studio can connect to their running Amazon EMR clusters from a JupyterLab notebook using their default SageMaker Distribution Images. Users of Studio Classic can connect to their clusters from a Studio Classic notebook using any of the supported kernels.

Connect to an Amazon EMR cluster using the Studio UI

To connect to your cluster using the Studio or Studio Classic UI, you can either initiate a connection from the list of clusters accessed in List Amazon EMR clusters from Studio or Studio Classic, or from a notebook in SageMaker Studio or Studio Classic.

To connect to a particular cluster from your list of clusters
  1. Choose the name of the cluster in your list. This activates the Attach to new notebook button.

  2. Choose Attach to new notebook. This opens up the images and kernels selection box.

  3. Select your image and kernel, then choose Select. For a list of supported images, see Supported images and kernels to connect to an Amazon EMR cluster from Studio or Studio Classic or refer to Bring your own image.

  4. If the cluster you select does not use Kerberos, LDAP, or runtime role authentication, Studio or Studio Classic prompts you to select the credential type. Choose from Http basic authentication or No credentials, then enter your credentials, if applicable. A connection command populates the first cell of your notebook and initiates the connection with the Amazon EMR cluster.

    Once the connection succeeds, a message confirms the connection and the start of the Spark application.

Alternatively, you can connect to a cluster from a notebook.
  1. Choose Cluster at the top of your notebook.

    Cluster is only visible when you use a kernel from Supported images and kernels to connect to an Amazon EMR cluster from Studio or Studio Classic or from Bring your own image. If you cannot see Cluster at the top of your notebook, ensure that your administrator has configured the discoverability of your clusters and switch to a supported kernel.

    This opens up a list of available clusters in a Running state.

  2. Select the cluster to which you want to connect, then choose Connect.

  3. If you configured your Amazon EMR clusters to support runtime IAM roles and your administrator preloaded your roles in an execution role configuration JSON, you can select your Amazon EMR access role from the Amazon EMR execution role drop down menu. If your roles are not preloaded, Studio or Studio Classic uses your Studio or Studio Classic execution role by default. For information about using runtime roles with Amazon EMR, see Connect to an Amazon EMR cluster from Studio Classic using runtime IAM roles. When you connect to a cluster, Studio or Studio Classic adds a code block to an active cell to establish the connection.

    Otherwise, if the cluster you choose does not use Kerberos, LDAP, or runtime role authentication, Studio or Studio Classic prompts you to select the credential type. You can choose HTTP basic authentication or No credential.

  4. An active cell populates and runs. This cell contains the connection command to connect to your Amazon EMR cluster.

    Once the connection succeeds, a message confirm the connection and the start of the Spark application.

Connect to an Amazon EMR cluster using a connection command

To establish a connection to an Amazon EMR cluster, you can execute connection commands within a notebook cell.

When establishing the connection, you can authenticate using Kerberos, Lightweight Directory Access Protocol (LDAP), or runtime IAM role authentication. The authentication method you choose depends on your cluster configuration.

You can refer to this example Access Apache Livy using a Network Load Balancer on a Kerberos-enabled Amazon EMR cluster to set up an Amazon EMR cluster that uses Kerberos authentication. Alternatively, you can explore the CloudFormation example templates using Kerberos or LDAP authentication in the aws-samples/sagemaker-studio-emr GitHub repository.

If your administrator has enabled cross-account access, you can connect to your Amazon EMR cluster from a Studio Classic notebook, regardless of whether your Studio Classic application and cluster reside in the same AWS account or different accounts.

For each of the following authentication types, use the specified command to connect to your cluster from your Studio or Studio Classic notebook.

  • Kerberos

    Append the --assumable-role-arn argument if you need cross-account Amazon EMR access. Append the --verify-certificate argument if you connect to your cluster with HTTPS.

    %load_ext sagemaker_studio_analytics_extension.magics %sm_analytics emr connect --cluster-id cluster_id \ --auth-type Kerberos --language python [--assumable-role-arn EMR_access_role_ARN ] [--verify-certificate /home/user/certificateKey.pem]
  • LDAP

    Append the --assumable-role-arn argument if you need cross-account Amazon EMR access. Append the --verify-certificate argument if you connect to your cluster with HTTPS.

    %load_ext sagemaker_studio_analytics_extension.magics %sm_analytics emr connect --cluster-id cluster_id \ --auth-type Basic_Access --language python [--assumable-role-arn EMR_access_role_ARN ] [--verify-certificate /home/user/certificateKey.pem]
  • NoAuth

    Append the --assumable-role-arn argument if you need cross-account Amazon EMR access. Append the --verify-certificate argument if you connect to your cluster with HTTPS.

    %load_ext sagemaker_studio_analytics_extension.magics %sm_analytics emr connect --cluster-id cluster_id \ --auth-type None --language python [--assumable-role-arn EMR_access_role_ARN ] [--verify-certificate /home/user/certificateKey.pem]
  • Runtime IAM roles

    Append the --assumable-role-arn argument if you need cross-account Amazon EMR access. Append the --verify-certificate argument if you connect to your cluster with HTTPS.

    For more information on connecting to an Amazon EMR cluster using runtime IAM roles, see Connect to an Amazon EMR cluster from Studio Classic using runtime IAM roles.

    %load_ext sagemaker_studio_analytics_extension.magics %sm_analytics emr connect --cluster-id cluster_id \ --auth-type Basic_Access \ --emr-execution-role-arn arn:aws:iam::studio_account_id:role/emr-execution-role-name [--assumable-role-arn EMR_access_role_ARN] [--verify-certificate /home/user/certificateKey.pem]

Connect to an Amazon EMR cluster over HTTPS

If you have configured your Amazon EMR cluster with transit encryption enabled and Apache Livy server for HTTPS and would like Studio or Studio Classic to communicate with Amazon EMR using HTTPS, you need to configure Studio or Studio Classic to access your certificate key.

For self-signed or local Certificate Authority (CA) signed certificates, you can do this in two steps:

  1. Download the PEM file of your certificate to your local file system using one of the following options:

  2. Enable the validation of the certificate by providing the path to your certificate in the --verify-certificate argument of your connection command.

    %sm_analytics emr connect --cluster-id cluster_id \ --verify-certificate /home/user/certificateKey.pem ...

For public CA issued certificates, set the certificate validation by setting the --verify-certificate parameter as true.

Alternatively, you can disable the certificate validation by setting the --verify-certificate parameter as false.

You can find the list of available connection commands to an Amazon EMR cluster in Connect to an Amazon EMR cluster using a connection command.