Customizing an EMR Serverless image
Starting with Amazon EMR 6.9.0, you can use custom images to package application dependencies and runtime environments into a single container with Amazon EMR Serverless. This simplifies how you manage workload dependencies and makes your packages more portable. When you customize your EMR Serverless image, it provides the following benefits:
-
Installs and configures packages that are optimized to your workloads. These packages might not be widely available in the public distribution of Amazon EMR runtime environments.
-
Integrates EMR Serverless with current established build, test, and deployment processes within your organization, including local development and testing.
-
Applies established security processes, such as image scanning, that meet compliance and governance requirements within your organization.
-
Lets you use your own versions of JDK and Python for your applications.
EMR Serverless provides images that you can use as your base when you create your own
images. The base image provides the essential jars, configuration, and libraries for the image
to interact with EMR Serverless. You can find the base image in the Amazon ECR Public Gallery
Type | Image |
---|---|
Spark |
|
Hive |
|
Prerequisites
Before you create an EMR Serverless custom image, complete these prerequisites.
-
Create an Amazon ECR repository in the same AWS Region that you use to launch EMR Serverless applications. To create an Amazon ECR private repository, see Creating a private repository.
-
To grant users access to your Amazon ECR repository, add the following policies to users and roles that create or update EMR Serverless applications with images from this repository.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "ECRRepositoryListGetPolicy", "Effect": "Allow", "Action": [ "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:DescribeImages" ], "Resource": "
ecr-repository-arn
" } ] }For more examples of Amazon ECR identity-based policies, see Amazon Elastic Container Registry identity-based policy examples.
Step 1: Create a custom image from EMR Serverless base images
First, create a DockerfileFROM
instruction that uses your
preferred base image. After the FROM
instruction, you can include any
modification that you want to make to the image. The base image automatically sets the
USER
to hadoop
. This setting might not have permissions for all
the modifications you include. As a workaround, set the USER
to
root
, modify your image, and then set the USER
back to
hadoop:hadoop
. To see samples for common use cases, see Using custom images with EMR Serverless.
# Dockerfile FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest USER root # MODIFICATIONS GO HERE # EMRS will run the image as hadoop USER hadoop:hadoop
After you have the Dockerfile, build the image with the following command.
# build the docker image docker build . -t
aws-account-id
.dkr.ecr.region
.amazonaws.com/my-repository
[:tag]or[@digest]
Step 2: Validate image locally
EMR Serverless provides an offline tool that can statically check your custom image to
validate basic files, environment variables, and correct image configurations. For
information on how to install and run the tool, see the Amazon EMR Serverless
Image CLI GitHub
After you install the tool, run the following command to validate an image:
amazon-emr-serverless-image \ validate-image -r emr-6.9.0 -t spark \ -i
aws-account-id
.dkr.ecr.region
.amazonaws.com/my-repository
:tag/@digest
You should see an output similar to the following.
Amazon EMR Serverless - Image CLI Version: 0.0.1 ... Checking if docker cli is installed ... Checking Image Manifest [INFO] Image ID: 9e2f4359cf5beb466a8a2ed047ab61c9d37786c555655fc122272758f761b41a [INFO] Created On: 2022-12-02T07:46:42.586249984Z [INFO] Default User Set to hadoop:hadoop : PASS [INFO] Working Directory Set to : PASS [INFO] Entrypoint Set to /usr/bin/entrypoint.sh : PASS [INFO] HADOOP_HOME is set with value: /usr/lib/hadoop : PASS [INFO] HADOOP_LIBEXEC_DIR is set with value: /usr/lib/hadoop/libexec : PASS [INFO] HADOOP_USER_HOME is set with value: /home/hadoop : PASS [INFO] HADOOP_YARN_HOME is set with value: /usr/lib/hadoop-yarn : PASS [INFO] HIVE_HOME is set with value: /usr/lib/hive : PASS [INFO] JAVA_HOME is set with value: /etc/alternatives/jre : PASS [INFO] TEZ_HOME is set with value: /usr/lib/tez : PASS [INFO] YARN_HOME is set with value: /usr/lib/hadoop-yarn : PASS [INFO] File Structure Test for hadoop-files in /usr/lib/hadoop: PASS [INFO] File Structure Test for hadoop-jars in /usr/lib/hadoop/lib: PASS [INFO] File Structure Test for hadoop-yarn-jars in /usr/lib/hadoop-yarn: PASS [INFO] File Structure Test for hive-bin-files in /usr/bin: PASS [INFO] File Structure Test for hive-jars in /usr/lib/hive/lib: PASS [INFO] File Structure Test for java-bin in /etc/alternatives/jre/bin: PASS [INFO] File Structure Test for tez-jars in /usr/lib/tez: PASS ----------------------------------------------------------------- Overall Custom Image Validation Succeeded. -----------------------------------------------------------------
Step 3: Upload the image to your Amazon ECR repository
Push your Amazon ECR image to your Amazon ECR repository with the following commands. Ensure you have the correct IAM permissions to push the image to your repository. For more information, see Pushing an image in the Amazon ECR User Guide.
# login to ECR repo aws ecr get-login-password --region region | docker login --username AWS --password-stdin
aws-account-id
.dkr.ecr.region
.amazonaws.com # push the docker image docker pushaws-account-id
.dkr.ecr.region
.amazonaws.com/my-repository
:tag/@digest
Step 4: Create or update an application with custom images
Choose the AWS Management Console tab or AWS CLI tab according to how you want to launch your application, then complete the following steps.
Step 5: Allow EMR Serverless to access the custom image repository
Add the following resource policy to the Amazon ECR repository to allow the EMR Serverless
service principal to use the get
, describe
, and
download
requests from this repository.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Emr Serverless Custom Image Support", "Effect": "Allow", "Principal": { "Service": "emr-serverless.amazonaws.com" }, "Action": [ "ecr:BatchGetImage", "ecr:DescribeImages", "ecr:GetDownloadUrlForLayer" ], "Condition":{ "StringEquals":{ "aws:SourceArn": "arn:aws:emr-serverless:
region
:aws-account-id
:/applications/application-id
" } } } ] }
As a security best practice, add an aws:SourceArn
condition key to the
repository policy. The IAM global condition key aws:SourceArn
ensures that
EMR Serverless uses the repository only for an application ARN. For more information on
Amazon ECR repository policies, see Creating a private
repository.
Considerations and limitations
When you work with custom images, consider the following:
-
Use the correct base image that matches the type (Spark or Hive) and release label (for example,
emr-6.9.0
) for your application. -
EMR Serverless ignores
[CMD]
or[ENTRYPOINT]
instructions in the Docker file. Use common instructions in the Docker file, such as[COPY]
,[RUN]
, and[WORKDIR]
. -
You shouldn't modify environment variables
JAVA_HOME
,SPARK_HOME
,HIVE_HOME
,TEZ_HOME
when you create a custom image. -
Custom images can't exceed 5 GB in size.
-
If you modify binaries or jars in the Amazon EMR base images, it might cause application or job launch failures.
-
The Amazon ECR repository should be in the same AWS Region that you use to launch EMR Serverless applications.