Getting started - Amazon EMR

Getting started

This topic helps you get started using Amazon EMR on EKS by deploying a Spark Python application on a virtual cluster.

Before you begin, make sure that you have completed the steps in Setting up.

Note

When you create an EKS cluster with managed node group, make sure to use m5.xlarge or an instance type with equivalent or more vCPUs in order run the sample application successfully.

You will need the following information from the setup steps:

  • Virtual cluster ID for the Amazon EKS cluster and Kubernetes namespace registered with Amazon EMR

  • Name of the IAM role used for job execution

  • Release label for the Amazon EMR release (for example, emr-6.2.0-latest)

  • Optionally, destination targets for logging and monitoring:

    • Amazon CloudWatch Log group name and log stream prefix

    • Amazon S3 location to store event and container logs

Important

If you use Amazon CloudWatch or Amazon S3 for monitoring and logging, the IAM policy associated with the IAM role for job execution must have the required permissions to access the target resources. If the IAM policy doesn't have the required permissions, you must follow the steps outlined in Update the trust policy of the job execution role before running this sample job.

Run a Spark Python application

In this tutorial, you will run a simple pi.py Spark Python application on Amazon EMR on EKS. The application is bundled with Amazon EMR releases.

The logs from Spark driver and executors are pushed to Amazon CloudWatch Logs, which you can use to monitor your jobs This tutorial assumes that the IAM role supplied as the execution-role-arn has permissions to write to the log_stream_prefix under log_group_name.

You can initiate the sample application using the following command.

aws emr-containers start-job-run \ --virtual-cluster-id cluster_id \ --name sample-job-name \ --execution-role-arn execution-role-arn \ --release-label emr-6.2.0-latest \ --job-driver '{"sparkSubmitJobDriver": {"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py","sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"}}' \ --configuration-overrides '{"monitoringConfiguration": {"cloudWatchMonitoringConfiguration": {"logGroupName": "log_group_name", "logStreamNamePrefix": "log_stream_prefix"}}}'

You can also create a JSON file with specified parameters for your job run. Then run the start-job-run command with a path to the JSON file. For more information, see Submit a job run. For more details about configuring job run parameters, see Options for configuring a job run.

To monitor the progress of the job or to debug failures, you can inspect logs uploaded to CloudWatch Logs.

  1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. In the Navigation pane, choose Logs. Then choose Log groups.

  3. Choose the log group for Amazon EMR on EKS and then view the uploaded log events.


        Monitoring using CloudWatch logs