Customizing an EMR Serverless image - Amazon EMR

Customizing an EMR Serverless image

Starting with Amazon EMR 6.9.0, you can use custom images to package application dependencies and runtime environments into a single container with Amazon EMR Serverless. This simplifies how you manage workload dependencies and makes your packages more portable. When you customize your EMR Serverless image, it provides the following benefits:

  • Installs and configures packages that are optimized to your workloads. These packages might not be widely available in the public distribution of Amazon EMR runtime environments.

  • Integrates EMR Serverless with current established build, test, and deployment processes within your organization, including local development and testing.

  • Applies established security processes, such as image scanning, that meet compliance and governance requirements within your organization.

  • Lets you use your own versions of JDK and Python for your applications.

EMR Serverless provides images that you can use as your base when you create your own images. The base image provides the essential jars, configuration, and libraries for the image to interact with EMR Serverless. You can find the base image in the Amazon ECR Public Gallery. Use the image that matches your application type (Spark or Hive) and release version. For example, if you create an application on Amazon EMR release 6.9.0, use the following images.

Type Image

Spark

public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest

Hive

public.ecr.aws/emr-serverless/hive/emr-6.9.0:latest

Prerequisites

Before you create an EMR Serverless custom image, complete these prerequisites.

  1. Create an Amazon ECR repository in the same AWS Region that you use to launch EMR Serverless applications. To create an Amazon ECR private repository, see Creating a private repository.

  2. To grant users access to your Amazon ECR repository, add the following policies to users and roles that create or update EMR Serverless applications with images from this repository.

    { "Version": "2012-10-17", "Statement": [ { "Sid": "ECRRepositoryListGetPolicy", "Effect": "Allow", "Action": [ "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:DescribeImages" ], "Resource": "ecr-repository-arn" } ] }

    For more examples of Amazon ECR identity-based policies, see Amazon Elastic Container Registry identity-based policy examples.

Step 1: Create a custom image from EMR Serverless base images

First, create a Dockerfile that begins with a FROM instruction that uses your preferred base image. After the FROM instruction, you can include any modification that you want to make to the image. The base image automatically sets the USER to hadoop. This setting might not have permissions for all the modifications you include. As a workaround, set the USER to root, modify your image, and then set the USER back to hadoop:hadoop. To see samples for common use cases, see Using custom images with EMR Serverless.

# Dockerfile FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest USER root # MODIFICATIONS GO HERE # EMRS will run the image as hadoop USER hadoop:hadoop

After you have the Dockerfile, build the image with the following command.

# build the docker image docker build . -t aws-account-id.dkr.ecr.region.amazonaws.com/my-repository[:tag]or[@digest]

Step 2: Validate image locally

EMR Serverless provides an offline tool that can statically check your custom image to validate basic files, environment variables, and correct image configurations. For information on how to install and run the tool, see the Amazon EMR Serverless Image CLI GitHub.

After you install the tool, run the following command to validate an image:

amazon-emr-serverless-image \ validate-image -r emr-6.9.0 -t spark \ -i aws-account-id.dkr.ecr.region.amazonaws.com/my-repository:tag/@digest

You should see an output similar to the following.

Amazon EMR Serverless - Image CLI Version: 0.0.1 ... Checking if docker cli is installed ... Checking Image Manifest [INFO] Image ID: 9e2f4359cf5beb466a8a2ed047ab61c9d37786c555655fc122272758f761b41a [INFO] Created On: 2022-12-02T07:46:42.586249984Z [INFO] Default User Set to hadoop:hadoop : PASS [INFO] Working Directory Set to : PASS [INFO] Entrypoint Set to /usr/bin/entrypoint.sh : PASS [INFO] HADOOP_HOME is set with value: /usr/lib/hadoop : PASS [INFO] HADOOP_LIBEXEC_DIR is set with value: /usr/lib/hadoop/libexec : PASS [INFO] HADOOP_USER_HOME is set with value: /home/hadoop : PASS [INFO] HADOOP_YARN_HOME is set with value: /usr/lib/hadoop-yarn : PASS [INFO] HIVE_HOME is set with value: /usr/lib/hive : PASS [INFO] JAVA_HOME is set with value: /etc/alternatives/jre : PASS [INFO] TEZ_HOME is set with value: /usr/lib/tez : PASS [INFO] YARN_HOME is set with value: /usr/lib/hadoop-yarn : PASS [INFO] File Structure Test for hadoop-files in /usr/lib/hadoop: PASS [INFO] File Structure Test for hadoop-jars in /usr/lib/hadoop/lib: PASS [INFO] File Structure Test for hadoop-yarn-jars in /usr/lib/hadoop-yarn: PASS [INFO] File Structure Test for hive-bin-files in /usr/bin: PASS [INFO] File Structure Test for hive-jars in /usr/lib/hive/lib: PASS [INFO] File Structure Test for java-bin in /etc/alternatives/jre/bin: PASS [INFO] File Structure Test for tez-jars in /usr/lib/tez: PASS ----------------------------------------------------------------- Overall Custom Image Validation Succeeded. -----------------------------------------------------------------

Step 3: Upload the image to your Amazon ECR repository

Push your Amazon ECR image to your Amazon ECR repository with the following commands. Ensure you have the correct IAM permissions to push the image to your repository. For more information, see Pushing an image in the Amazon ECR User Guide.

# login to ECR repo aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws-account-id.dkr.ecr.region.amazonaws.com # push the docker image docker push aws-account-id.dkr.ecr.region.amazonaws.com/my-repository:tag/@digest

Step 4: Create or update an application with custom images

Choose the AWS Management Console tab or AWS CLI tab according to how you want to launch your application, then complete the following steps.

Console
  1. Sign in to the EMR Studio console at https://console.aws.amazon.com/emr. Navigate to your application, or create a new application with the instructions in Create an application.

  2. To specify custom images when you create or update an EMR Serverless application, select Custom settings in the application setup options.

  3. In the Custom image settings section, select the Use the custom image with this application check box.

  4. Paste the Amazon ECR image URI into the Image URI field. EMR Serverless uses this image for all worker types for the application. Alternatively, you can choose Different custom images and paste different Amazon ECR image URIs for each worker type.

CLI
  • Create an application with the image-configuration parameter. EMR Serverless applies this setting to all worker types.

    aws emr-serverless create-application \ --release-label emr-6.9.0 \ --type SPARK \ --image-configuration '{ "imageUri": "aws-account-id.dkr.ecr.region.amazonaws.com/my-repository:tag/@digest" }'

    To create an application with different image settings for each worker type, use the worker-type-specifications parameter.

    aws emr-serverless create-application \ --release-label emr-6.9.0 \ --type SPARK \ --worker-type-specifications '{ "Driver": { "imageConfiguration": { "imageUri": "aws-account-id.dkr.ecr.region.amazonaws.com/my-repository:tag/@digest" } }, "Executor" : { "imageConfiguration": { "imageUri": "aws-account-id.dkr.ecr.region.amazonaws.com/my-repository:tag/@digest" } } }'

    To update an application, use the image-configuration parameter. EMR Serverless applies this setting to all worker types.

    aws emr-serverless update-application \ --application-id application-id \ --image-configuration '{ "imageUri": "aws-account-id.dkr.ecr.region.amazonaws.com/my-repository:tag/@digest" }'

Step 5: Allow EMR Serverless to access the custom image repository

Add the following resource policy to the Amazon ECR repository to allow the EMR Serverless service principal to use the get, describe, and download requests from this repository.

{ "Version": "2012-10-17", "Statement": [ { "Sid": "Emr Serverless Custom Image Support", "Effect": "Allow", "Principal": { "Service": "emr-serverless.amazonaws.com" }, "Action": [ "ecr:BatchGetImage", "ecr:DescribeImages", "ecr:GetDownloadUrlForLayer" ], "Condition":{ "StringEquals":{ "aws:SourceArn": "arn:aws:emr-serverless:region:aws-account-id:/applications/application-id" } } } ] }

As a security best practice, add an aws:SourceArn condition key to the repository policy. The IAM global condition key aws:SourceArn ensures that EMR Serverless uses the repository only for an application ARN. For more information on Amazon ECR repository policies, see Creating a private repository.

Considerations and limitations

When you work with custom images, consider the following:

  • Use the correct base image that matches the type (Spark or Hive) and release label (for example, emr-6.9.0) for your application.

  • EMR Serverless ignores [CMD] or [ENTRYPOINT] instructions in the Docker file. Use common instructions in the Docker file, such as [COPY], [RUN], and [WORKDIR].

  • You shouldn't modify environment variables JAVA_HOME, SPARK_HOME, HIVE_HOME, TEZ_HOME when you create a custom image.

  • Custom images can't exceed 5 GB in size.

  • If you modify binaries or jars in the Amazon EMR base images, it might cause application or job launch failures.

  • The Amazon ECR repository should be in the same AWS Region that you use to launch EMR Serverless applications.