Prepare data using Amazon EMR - Amazon SageMaker

Prepare data using Amazon EMR

Amazon SageMaker Studio Classic comes with built-in integration of Amazon EMR, with which data scientists and data engineers can perform petabyte-scale interactive data preparation and machine learning (ML) right from their Studio Classic notebook. Within a notebook, they can discover and connect to existing Amazon EMR clusters, then interactively explore, visualize, and prepare large-scale data for machine learning using Apache Spark, Apache Hive, Presto. Additionally, users can access Spark UI with a single click to monitor their Spark jobs from their Studio Classic notebooks.

Administrators can use the AWS Service Catalog to define AWS CloudFormation templates of Amazon EMR clusters accessible to Studio Classic users. Data scientists can then choose a predefined template to self-provision an Amazon EMR cluster directly from Amazon SageMaker Studio Classic notebooks. Administrators can further parameterize the templates to let users choose aspects of the cluster to match their workloads within predefined values. For example, a data scientist or data engineer may want to specify the number of core nodes of the cluster up to a predetermined maximum value, or select the instance type of a node from a dropdown menu.