Getting started with Amazon EMR Serverless - Amazon EMR

Getting started with Amazon EMR Serverless

This tutorial helps you get started with EMR Serverless when you deploy a sample Spark or Hive workload. You'll create, run, and debug your own application. We show default options in most parts of this tutorial.

Prerequisites

Before you launch an EMR Serverless application, complete the following tasks.

Grant permissions to use EMR Serverless

To use EMR Serverless, you need a user or IAM role with an attached policy that grants permissions for EMR Serverless. To create a user and attach the appropriate policy to that user, follow the instructions in Grant permissions.

Prepare storage for EMR Serverless

In this tutorial, you'll use an S3 bucket to store output files and logs from the sample Spark or Hive workload that you'll run using an EMR Serverless application. To create a bucket, follow the instructions in Creating a bucket in the Amazon Simple Storage Service Console User Guide. Replace any further reference to DOC-EXAMPLE-BUCKET with the name of the newly created bucket.

Create a job runtime role

Job runs in EMR Serverless use a runtime role that provides granular permissions to specific AWS services and resources at runtime. In this tutorial, a public S3 bucket hosts the data and scripts. The bucket DOC-EXAMPLE-BUCKET stores the output.

To set up a job runtime role, first create a runtime role with a trust policy so that EMR Serverless can use the new role. Next, attach the required S3 access policy to that role. The following steps guide you through the process.

Console
  1. Navigate to the IAM console at https://console.aws.amazon.com/iam/.

  2. In the left navigation pane, choose Roles.

  3. Choose Create role.

  4. For role type, choose Custom trust policy and paste the following trust policy. This allows jobs submitted to your Amazon EMR Serverless applications to access other AWS services on your behalf.

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "emr-serverless.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
  5. Choose Next to navigate to the Add permissions page, then choose Create policy.

  6. The Create policy page opens on a new tab. Paste the policy JSON below.

    Important

    Replace DOC-EXAMPLE-BUCKET in the policy below with the actual bucket name created in Prepare storage for EMR Serverless. This is a basic policy for S3 access. For more job runtime role examples, see Job runtime roles.

    { "Version": "2012-10-17", "Statement": [ { "Sid": "ReadAccessForEMRSamples", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::*.elasticmapreduce", "arn:aws:s3:::*.elasticmapreduce/*" ] }, { "Sid": "FullAccessToOutputBucket", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::DOC-EXAMPLE-BUCKET", "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*" ] }, { "Sid": "GlueCreateAndReadDataCatalog", "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:CreateDatabase", "glue:GetDataBases", "glue:CreateTable", "glue:GetTable", "glue:UpdateTable", "glue:DeleteTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetUserDefinedFunctions" ], "Resource": ["*"] } ] }
  7. On the Review policy page, enter a name for your policy, such as EMRServerlessS3AndGlueAccessPolicy.

  8. Refresh the Attach permissions policy page, and choose EMRServerlessS3AndGlueAccessPolicy.

  9. In the Name, review, and create page, for Role name, enter a name for your role, for example, EMRServerlessS3RuntimeRole. To create this IAM role, choose Create role.

CLI
  1. Create a file named emr-serverless-trust-policy.json that contains the trust policy to use for the IAM role. The file should contain the following policy.

    { "Version": "2012-10-17", "Statement": [{ "Sid": "EMRServerlessTrustPolicy", "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": "emr-serverless.amazonaws.com" } }] }
  2. Create an IAM role named EMRServerlessS3RuntimeRole. Use the trust policy that you created in the previous step.

    aws iam create-role \ --role-name EMRServerlessS3RuntimeRole \ --assume-role-policy-document file://emr-serverless-trust-policy.json

    Note the ARN in the output. You use the ARN of the new role during job submission, referred to after this as the job-role-arn.

  3. Create a file named emr-sample-access-policy.json that defines the IAM policy for your workload. This provides read access to the script and data stored in public S3 buckets and read-write access to DOC-EXAMPLE-BUCKET.

    Important

    Replace DOC-EXAMPLE-BUCKET in the policy below with the actual bucket name created in Prepare storage for EMR Serverless.. This is a basic policy for AWS Glue and S3 access. For more job runtime role examples, see Job runtime roles.

    { "Version": "2012-10-17", "Statement": [ { "Sid": "ReadAccessForEMRSamples", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::*.elasticmapreduce", "arn:aws:s3:::*.elasticmapreduce/*" ] }, { "Sid": "FullAccessToOutputBucket", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::DOC-EXAMPLE-BUCKET", "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*" ] }, { "Sid": "GlueCreateAndReadDataCatalog", "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:CreateDatabase", "glue:GetDataBases", "glue:CreateTable", "glue:GetTable", "glue:UpdateTable", "glue:DeleteTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetUserDefinedFunctions" ], "Resource": ["*"] } ] }
  4. Create an IAM policy named EMRServerlessS3AndGlueAccessPolicy with the policy file that you created in Step 3. Take note of the ARN in the output, as you will use the ARN of the new policy in the next step.

    aws iam create-policy \ --policy-name EMRServerlessS3AndGlueAccessPolicy \ --policy-document file://emr-sample-access-policy.json

    Note the new policy's ARN in the output. You'll substitute it for policy-arn in the next step.

  5. Attach the IAM policy EMRServerlessS3AndGlueAccessPolicy to the job runtime role EMRServerlessS3RuntimeRole.

    aws iam attach-role-policy \ --role-name EMRServerlessS3RuntimeRole \ --policy-arn policy-arn

Getting started from the console

Step 1: Create an EMR Serverless application

Create a new application with EMR Serverless as follows.

  1. Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/emr.

  2. In the left navigation pane, choose Serverless to navigate to the EMR Serverless landing page.

  3. On the landing page, choose the Get started option.

  4. To create or manage EMR Serverless applications, you need the EMR Studio UI. If you don't have an EMR Studio in the AWS Region where you're creating an application, we create a EMR Studio for you as part of this step. Choose Create and launch Studio to proceed to navigate inside the Studio.

  5. On the next page, enter the name, type, and release version of your application. For this tutorial, choose the default settings. You can change these later if desired. When you choose these settings, you give your application pre-initialized capacity that's ready to run a single job, but the application can scale up as needed. Select Create application to create your first application. This takes you to the Application details page in EMR Studio, which you will use in Step 2: Submit a job run to submit a job run.

Step 2: Submit a job run

Spark

In this tutorial, we use a PySpark script to compute the number of occurrences of unique words across multiple text files. A public, read-only S3 bucket stores both the script and the dataset.

To run a Spark job
  1. Upload the sample script wordcount.py into your new bucket with the following command.

    aws s3 cp s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py s3://DOC-EXAMPLE-BUCKET/scripts/
  2. Completing Step 1: Create an EMR Serverless application takes you to the Application details page in EMR Studio. There, choose the Submit job option.

  3. On the Submit job page, complete the following.

    • In the Name field, enter the name that you want to call your job run.

    • In the Runtime role field, enter the name of the role that you created in Create a job runtime role.

    • In the Script location field, enter s3://DOC-EXAMPLE-BUCKET/scripts/wordcount.py as the S3 URI.

    • In the Script arguments field, enter ["s3://DOC-EXAMPLE-BUCKET/emr-serverless-spark/output"].

    • In the Spark properties section, choose Edit as text and enter the following configurations.

      --conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1
  4. To start the job run, choose Submit job .

  5. In the Job runs tab, you should see your new job run with a Running status.

Hive

In this part of the tutorial, we create a table, insert a few records, and run a count aggregation query. To run the Hive job, first create a file that contains all Hive queries to run as part of single job, upload the file to S3, and specify this S3 path when starting the Hive job.

To run a Hive job
  1. Create a file called hive-query.ql that contains all the queries that you want to run in your Hive job.

    create database if not exists emrserverless; use emrserverless; create table if not exists test_table(id int); drop table if exists Values__Tmp__Table__1; insert into test_table values (1),(2),(2),(3),(3),(3); select id, count(id) from test_table group by id order by id desc;
  2. Upload hive-query.ql to your S3 bucket with the following command.

    aws s3 cp hive-query.ql s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/query/hive-query.ql
  3. Completing Step 1: Create an EMR Serverless application takes you to the Application details page in EMR Studio. There, choose the Submit job option.

  4. On the Submit job page, complete the following.

    • In the Name field, enter the name that you want to call your job run.

    • In the Runtime role field, enter the name of the role that you created in Create a job runtime role.

    • In the Script location field, enter s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/query/hive-query.ql as the S3 URI.

    • In the Hive properties section, choose Edit as text, and enter the following configurations.

      --hiveconf hive.log.explain.output=false
    • In the Job configuration section, choose Edit as JSON, and enter the following JSON.

      { "applicationConfiguration": [{ "classification": "hive-site", "properties": { "hive.exec.scratchdir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/hive/scratch", "hive.metastore.warehouse.dir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/hive/warehouse", "hive.driver.cores": "2", "hive.driver.memory": "4g", "hive.tez.container.size": "4096", "hive.tez.cpu.vcores": "1" } }] }
  5. To start the job run, choose Submit job.

  6. In the Job runs tab, you should see your new job run with a Running status.

Step 3: View application UI and logs

To view the application UI, first identify the job run. An option for Spark UI or Hive Tez UI is available in the first row of options for that job run, based on the job type. Select the appropriate option.

If you chose the Spark UI, choose the Executors tab to view the driver and executors logs. If you chose the Hive Tez UI, choose the All Tasks tab to view the logs.

Once the job run status shows as Success, you can view the output of the job in your S3 bucket.

Step 4: Clean up

While the application you created should auto-stop after 15 minutes of inactivity, we still recommend that you release resources that you don't intend to use again.

To delete the application, navigate to the List applications page. Select the application that you created and choose Actions → Stop to stop the application. After the application is in the STOPPED state, select the same application and choose Actions → Delete.

For more examples of running Spark and Hive jobs, see Spark jobs and Hive jobs.

Getting started from the AWS CLI

Step 1: Create an EMR Serverless application

Use the emr-serverless create-application command to create your first EMR Serverless application. You need to specify the application type and the the Amazon EMR release label associated with the application version you want to use. The name of the application is optional.

Spark

To create a Spark application, run the following command.

aws emr-serverless create-application \ --release-label emr-6.6.0 \ --type "SPARK" \ --name my-application
Hive

To create a Hive application, run the following command.

aws emr-serverless create-application \ --release-label emr-6.6.0 \ --type "HIVE" \ --name my-application

Note the application ID returned in the output. You'll use the ID to start the application and during job submission, referred to after this as the application-id.

Before you move on to Step 2: Submit a job run to your EMR Serverless application, make sure that your application has reached the CREATED state with the get-application API.

aws emr-serverless get-application \ --application-id application-id

EMR Serverless creates workers to accommodate your requested jobs. By default, these are created on demand, but you can also specify a pre-initialized capacity by setting the initialCapacity parameter when you create the application. You can also limit the total maximum capacity that an application can use with the maximumCapacity parameter. To learn more about these options, see Configuring an application.

Step 2: Submit a job run to your EMR Serverless application

Now your EMR Serverless application is ready to run jobs.

Spark

In this step, we use a PySpark script to compute the number of occurrences of unique words across multiple text files. A public, read-only S3 bucket stores both the script and the dataset. The application sends the output file and the log data from the Spark runtime to /output and /logs directories in the S3 bucket that you created.

To run a Spark job
  1. Use the following command to copy the sample script we will run into your new bucket.

    aws s3 cp s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py s3://DOC-EXAMPLE-BUCKET/scripts/
  2. In the following command, substitute application-id with your application ID. Substitute job-role-arn with the runtime role ARN you created in Create a job runtime role. Substitute job-run-name with the name you want to call your job run. Replace all DOC-EXAMPLE-BUCKET strings with the Amazon S3 bucket that you created, and add /output to the path. This creates a new folder in your bucket where EMR Serverless can copy the output files of your application.

    aws emr-serverless start-job-run \ --application-id application-id \ --execution-role-arn job-role-arn \ --name job-run-name \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://DOC-EXAMPLE-BUCKET/scripts/wordcount.py", "entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET/emr-serverless-spark/output"], "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1" } }'
  3. Note the job run ID returned in the output . Replace job-run-id with this ID in the following steps.

Hive

In this tutorial, we create a table, insert a few records, and run a count aggregation query. To run the Hive job, first create a file that contains all Hive queries to run as part of single job, upload the file to S3, and specify this S3 path when you start the Hive job.

To run a Hive job
  1. Create a file called hive-query.ql that contains all the queries that you want to run in your Hive job.

    create database if not exists emrserverless; use emrserverless; create table if not exists test_table(id int); drop table if exists Values__Tmp__Table__1; insert into test_table values (1),(2),(2),(3),(3),(3); select id, count(id) from test_table group by id order by id desc;
  2. Upload hive-query.ql to your S3 bucket with the following command.

    aws s3 cp hive-query.ql s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/query/hive-query.ql
  3. In the following command, substitute application-id with your own application ID. Substitute job-role-arn with the runtime role ARN you created in Create a job runtime role. Replace all DOC-EXAMPLE-BUCKET strings with the Amazon S3 bucket that you created, and add /output and /logs to the path. This creates new folders in your bucket, where EMR Serverless can copy the output and log files of your application.

    aws emr-serverless start-job-run \ --application-id application-id \ --execution-role-arn job-role-arn \ --job-driver '{ "hive": { "query": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/query/hive-query.ql", "parameters": "--hiveconf hive.log.explain.output=false" } }' \ --configuration-overrides '{ "applicationConfiguration": [{ "classification": "hive-site", "properties": { "hive.exec.scratchdir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/hive/scratch", "hive.metastore.warehouse.dir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/hive/warehouse", "hive.driver.cores": "2", "hive.driver.memory": "4g", "hive.tez.container.size": "4096", "hive.tez.cpu.vcores": "1" } }], "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/logs" } } }'
  4. Note the job run ID returned in the output. Replace job-run-id with this ID in the following steps.

Step 3: Review your job run's output

The job run should typically take 3-5 minutes to complete.

Spark

You can check for the state of your Spark job with the following command.

aws emr-serverless get-job-run \ --application-id application-id \ --job-run-id job-run-id

With your log destination set to s3://DOC-EXAMPLE-BUCKET/emr-serverless-spark/logs, you can find the logs for this specific job run under s3://DOC-EXAMPLE-BUCKET/emr-serverless-spark/logs/applications/application-id/jobs/job-run-id.

For Spark applications, EMR Serverless pushes event logs every 30 seconds to the sparklogs folder in your S3 log destination. When your job completes, Spark runtime logs for the driver and executors upload to folders named appropriately by the worker type, such as driver or executor. The output of the PySpark job uploads to s3://DOC-EXAMPLE-BUCKET/output/.

Hive

You can check for the state of your Hive job with the following command.

aws emr-serverless get-job-run \ --application-id application-id \ --job-run-id job-run-id

With your log destination set to s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/logs, you can find the logs for this specific job run under s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/logs/applications/application-id/jobs/job-run-id.

For Hive applications, EMR Serverless continuously uploads the Hive driver to the HIVE_DRIVER folder, and Tez tasks logs to the TEZ_TASK folder, of your S3 log destination. After the job run reaches the SUCCEEDED state, the output of your Hive query becomes available in the Amazon S3 location that you specified in the monitoringConfiguration field of configurationOverrides.

Step 4: Clean up

When you’re done working with this tutorial, consider deleting the resources that you created. We recommend that you release resources that you don't intend to use again.

Delete your application

To delete an application, use the following command.

aws emr-serverless delete-application \ --application-id application-id

Delete your S3 log bucket

To delete your S3 logging and output bucket, use the following command. Replace DOC-EXAMPLE-BUCKET with the actual name of the S3 bucket created in Prepare storage for EMR Serverless..

aws s3 rm s3://DOC-EXAMPLE-BUCKET --recursive aws s3api delete-bucket --bucket DOC-EXAMPLE-BUCKET

Delete your job runtime role

To delete the runtime role, detach the policy from the role. You can then delete both the role and the policy.

aws iam detach-role-policy \ --role-name EMRServerlessS3RuntimeRole \ --policy-arn policy-arn

To delete the role, use the following command.

aws iam delete-role \ --role-name EMRServerlessS3RuntimeRole

To delete the policy that was attached to the role, use the following command.

aws iam delete-policy \ --policy-arn policy-arn

For more examples of running Spark and Hive jobs, see Spark jobs and Hive jobs.