Connect to an Amazon EMR Cluster from Studio - Amazon SageMaker

Connect to an Amazon EMR Cluster from Studio

This guide explains how you can connect to an Amazon EMR cluster from SageMaker Studio with the PySpark kernel selected.

To connect Amazon EMR cluster with PySpark kernel selected

  1. After you connect to Studio, if you have an existing Studio notebook instance, open that. Otherwise, to create a new notebook instance, select File, and then select New.

  2. After you have an open Studio notebook instance, choose a kernel and instance.

    Note

    Only a subset of kernels can connect to an Amazon EMR cluster. The supported images are Data Science and SparkMagic. The supported kernels are PySpark from the SparkMagic image and Python3 (IPython) from the Data Science image. Studio supports both PySpark and Scala kernels.

    To switch your kernel, select in the top right of the UI the currently selected kernel where a pop-up window appears. Then select a kernel of your choice from the kernel drop-down menu. Lastly, select the Select button to make your changes.

  3. After you have selected your kernel of choice, select Cluster.

  4. A Connect to cluster UI screen will appear. Choose a cluster and select Connect. Not all Amazon EMR clusters can be connected to Studio. For more information, see Perform interactive data processing using Spark in Studio Notebooks.

    1. When you connect to a cluster, it adds a code block to an active cell to establish the connection.

  5. If the cluster that you're connecting to does not use Kerberos or Lightweight Directory Access Protocol (LDAP) connection, you will be prompted to select the credential type. You can choose HTTP basic authentication or No credential.

  6. An active cell will populate. This will contain the connection information that you need for connecting to the Amazon EMR cluster that you selected.

    1. When the authentication type is Kerberos and HTTP Basic Auth, a widget will be created in an active cell for you to provide your Username and Password. The following screenshot shows a successful connection after entering these credentials.

    2. If the cluster that you are connecting to does not use Kerberos or LDAP, and you selected No credentials, you will automatically connect to an Amazon EMR cluster. The following screenshot shows the UI after credentials have been successfully entered.

    • This step is optional. If you want to change the Amazon EMR cluster that the Studio notebook is connected to, select Cluster at the top-right of your notebook. After selecting Cluster, browse the list of clusters and select a different cluster.

For more information on required permissions, see Required Permissions.

Connect Amazon EMR Clusters Across Accounts

If you have set up cross-account discoverability and connectivity, when you select Cluster, all clusters from both Studio and remote accounts will show. After you select Connect, Studio will initiate and establish a connection to the Amazon EMR cluster in the remote account. The following screenshot shows this connection.