Set Up a Connection to an Amazon EMR Cluster - Amazon SageMaker

Set Up a Connection to an Amazon EMR Cluster

Amazon EMR is a big data platform for processing vast amounts of data. The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon EC2 instances. Apache Spark is a distributed processing framework that runs on Amazon EMR. For more information, see What Is Amazon EMR? and Apache Spark.

Amazon SageMaker Studio comes with a SageMaker SparkMagic image that contains a PySpark kernel. The SparkMagic image also contains an AWS CLI utility, sm-sparkmagic, that you can use to create the configuration files required for the PySpark kernel to connect to the Amazon EMR cluster. After creating the configuration files, the utility displays the steps required to finish the setup.

For added security, you can specify that the connection to the EMR cluster uses Kerberos authentication. For more information, see Use Kerberos Authentication.

Prerequisites

  • Access to SageMaker Studio that's set up to use Amazon VPC mode. For more information, see Choose a VPC.

  • An Amazon EMR cluster in the same VPC as Studio or in a VPC that's connected to the same VPC as Studio.

  • If you use the sm-sparkmagic utility, the IAM execution role associated with your Studio user profile must contain the following extra permissions. To find the execution role, choose your user name in the SageMaker Studio Control Panel.

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "elasticmapreduce:DescribeCluster", "elasticmapreduce:DescribeSecurityConfiguration", "elasticmapreduce:ListInstances" ], "Resource": [ "arn:aws:elasticmapreduce:*:*:cluster/*" ] } ] }

To set up a connection to an EMR cluster

  1. Open SageMaker Studio.

  2. In the upper-left corner of Studio, choose Amazon SageMaker Studio to open Studio Launcher.

  3. On the Launcher page, choose Notebooks and compute resources.

  4. For Select a SageMaker image, choose the SparkMagic image.

  5. Choose Notebook to create a Studio notebook in the SparkMagic image.

  6. Run the following code in a notebook cell to create the configuration files used to connect to the EMR cluster. %%local ensures that the code runs in the local image instead of on Spark.

    • If the EMR cluster is not configured for Kerberos authentication, run the following command:

      %%local ! sm-sparkmagic connect --cluster-id "cluster-id"

      The output should be similar to the following:

      Successfully read emr cluster(cluster-id) details SparkMagic config file location: /etc/sparkmagic/config.json
    • If the EMR cluster is configured for Kerberos authentication, run the following command:

      ! sm-sparkmagic connect --cluster-id "cluster-id" --user-name "user-name"

      The output should be similar to the following:

      Successfully read emr cluster(cluster-id) details SparkMagic config file location: /etc/sparkmagic/config.json Kerberos configuration file location: /etc/krb5.conf
  7. To complete the setup, do one of the following:

    • For EMR clusters that are not configured for Kerberos authentication, go to step 8.

    • For EMR clusters that are configured for Kerberos authentication, do the following:

      1. In the notebook toolbar, choose the Launch terminal icon ( ) to open a terminal in the same SparkMagic image as the notebook.

      2. Run the following command in the terminal to get the Kerberos ticket:

        kinit user-name
      3. Enter your password when prompted.

  8. In the notebook toolbar, choose the Restart kernel icon ( ) to complete the setup.

  9. To verify that the connection was set up correctly, run the following command in a notebook cell:

    %%info

    The output should be similar to the following:

    Current session configs:{'driverMemory': '1000M', 'executorCores': 2, 'kind': 'pyspark'} No active sessions.