Prepare Data at Scale with Studio Notebooks - Amazon SageMaker

Prepare Data at Scale with Studio Notebooks

Studio gives data scientists, machine learning (ML) engineers, and general practitioners the tools to perform data analytics and data preparation at scale. From within a Studio notebook, you can visually browse, discover, and connect to Amazon EMR. After you’re connected, you can interactively explore, visualize, and prepare petabyte-scale data for machine learning (ML) using Apache Spark, Hive, and Presto.

Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Running interactive analytics and data preparation on Amazon EMR and SageMaker Studio notebooks can serve as a unified environment for complete data science and data engineering workflows.

Studio also supports a tool to share your notebook with colleagues for collaboration through the UI. With this capability, you can now build ML workflows directly from Studio notebooks. Connecting to an Amazon EMR cluster using SageMaker Studio can also help improve team efficiency by streamlining the setup for ML workflows.

The supported images and kernels for connecting to an Amazon EMR cluster are as follows:

  • Images: Data Science, SparkMagic, PyTorch 1.8, TensorFlow 2.8

  • Kernel: PySpark and Spark kernels for the SparkMagic image under running apps, and Python 3 (IPython) for the Data Science image.

For guided instructions on how to connect to an Amazon EMR cluster from Studio, see Perform interactive data engineering and data science workflows from SageMaker Studio notebooks.

For detailed information about required permissions, see Required Permissions.

Prerequisites

  • You will need access to SageMaker Studio that is set up to use Amazon Virtual Private Cloud (Amazon VPC) mode.

  • All subnets used by SageMaker Studio must be private subnets.

  • If you use the sm-analytics utility to configure the SparkMagic kernel, follow one of these two prerequisites:

    • Make sure that the Amazon VPC interface endpoint is attached to all of the subnets used by SageMaker Studio.

    • Ensure that all of the subnets used by SageMaker Studio are routed to use a NAT gateway. For more information, see NAT gateways.

  • If either one of the following points apply to you, you must have Spark and Livy installed when using Amazon EMR.

    • Your Amazon EMR cluster is in the same Amazon VPC as Studio.

    • Your cluster is in a Amazon VPC that's connected to the Amazon VPC in Studio.

  • The security groups for both Amazon SageMaker Studio and Amazon EMR must allow access to and from each other.

  • Your Amazon EMR security group must open port 8998, so that Amazon SageMaker Studio can communicate with the Spark cluster through Livy. For more information about setting up the security group, see Build SageMaker notebooks backed by Spark in Amazon EMR.

  • To connect to an Amazon EMR cluster from Studio, you must first access SageMaker Studio. If you have not set up SageMaker Studio, follow the Get Started guide.

  • If you created a new domain during Studio setup, then discovering an Amazon EMR cluster from Studio should be available to you.

Bring your own image

If you want to bring your own image, first install the following dependencies to your kernel. The following list shows pip commands with the library name that you will install.

pip install sparkmagic pip install sagemaker-studio-sparkmagic-lib pip install sagemaker-studio-analytics-extension

You can update the libraries from the previous list manually, if they are not the latest version.

If you want to connect to Amazon EMR with Kerberos authentication, you must install the kinit client. Depending on your OS, the command to install the kinit client can vary. To bring an Ubuntu (Debian based) image, use the apt-get install -y -qq krb5-user command.