How to customize Docker images - Amazon EMR

How to customize Docker images

Take the following steps to customize Docker images for Amazon EMR on EKS.

Here are other options you may want to consider when customizing Docker images:

Prerequisites

Step 1: Retrieve a base image from Amazon Elastic Container Registry (Amazon ECR)

The base image contains the Amazon EMR runtime and connectors that are used to access other AWS services. For Amazon EMR 6.9.0 and higher, you can get the base images from the Amazon ECR Public Gallery. Browse the gallery to find the image link and pull the image to your local workspace. For example, for Amazon EMR 7.0.0 release, the following docker pull command gets you the lastest standard base image. You can replace emr-7.0.0:latest with emr-7.0.0-spark-rapids:latest to retrieve the image that has Nvidia RAPIDS accelerator. You can also replace emr-7.0.0:latest with emr-7.0.0-java11:latest to retrieve the image with Java 11 runtime.

docker pull public.ecr.aws/emr-on-eks/spark/emr-7.0.0:latest

If you would like to retrieve the base image for a Amazon EMR 6.9.0 or ealier releases, or if you prefer to retrieve from Amazon ECR registry accounts in each Region, use the following steps:

  1. Choose a base image URI. The image URI follows this format, ECR-registry-account.dkr.ecr.Region.amazonaws.com/spark/container-image-tag, as the following example demonstrates.

    895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0:latest

    To choose a base image in your Region, see How to select a base image URI.

  2. Log in to the Amazon ECR repository where the base image is stored. Replace 895885662937 and us-west-2 with the Amazon ECR registry account and the AWS Region you have selected.

    aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com
  3. Pull the base image into your local Workspace. Replace emr-6.6.0:latest with the container image tag you have selected.

    docker pull 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0:latest

Step 2: Customize a base image

Take the following steps to customize the base image you have pulled from Amazon ECR.

  1. Create a new Dockerfile on your local Workspace.

  2. Edit the Dockerfile you just created and add the following content. This Dockerfile uses the container image you have pulled from 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0:latest.

    FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0:latest USER root ### Add customization commands here #### USER hadoop:hadoop
  3. Add commands in the Dockerfile to customize the base image. For example, add a command to install Python libraries, as the following Dockerfile demonstrates.

    FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0:latest USER root RUN pip3 install --upgrade boto3 pandas numpy // For python 3 USER hadoop:hadoop
  4. From the same directory where the Dockerfile is created, run the following command to build the Docker image. Provide a name for the Docker image, for example, emr6.6_custom.

    docker build -t emr6.6_custom .

Step 3: (Optional but recommended) Validate a custom image

We recommend that you test the compatibility of your custom image before publishing it. You can use the Amazon EMR on EKS custom image CLI to check if your image has the required file structures and correct configurations for running on Amazon EMR on EKS.

Note

The Amazon EMR on EKS custom image CLI cannot confirm that your image is free of error. Use caution when removing dependencies from the base images.

Take the following steps to validate your custom image.

  1. Download and install Amazon EMR on EKS custom image CLI. For more information, see Amazon EMR on EKS custom image CLI Installation Guide.

  2. Run the following command to test the installation.

    emr-on-eks-custom-image --version

    The following shows an example of the output.

    Amazon EMR on EKS Custom Image CLI Version: x.xx
  3. Run the following command to validate your custom image.

    emr-on-eks-custom-image validate-image -i image_name -r release_version [-t image_type]
    • -i specifies the local image URI that needs to be validated. This can be the image URI, any name or tag that you defined for your image.

    • -r specifies the exact release version for the base image, for example, emr-6.6.0-latest.

    • -t specifies the image type. If this is a Spark image, input spark. The default value is spark. The current Amazon EMR on EKS custom image CLI version only supports Spark runtime images.

    If you run the command successfully and the custom image meets all the required configurations and file structures, the returned output displays the results of all of the tests, as the following example demonstrates.

    Amazon EMR on EKS Custom Image Test Version: x.xx ... Checking if docker cli is installed ... Checking Image Manifest [INFO] Image ID: xxx [INFO] Created On: 2021-05-17T20:50:07.986662904Z [INFO] Default User Set to hadoop:hadoop : PASS [INFO] Working Directory Set to /home/hadoop : PASS [INFO] Entrypoint Set to /usr/bin/entrypoint.sh : PASS [INFO] SPARK_HOME is set with value: /usr/lib/spark : PASS [INFO] JAVA_HOME is set with value: /etc/alternatives/jre : PASS [INFO] File Structure Test for spark-jars in /usr/lib/spark/jars: PASS [INFO] File Structure Test for hadoop-files in /usr/lib/hadoop: PASS [INFO] File Structure Test for hadoop-jars in /usr/lib/hadoop/lib: PASS [INFO] File Structure Test for bin-files in /usr/bin: PASS ... Start Running Sample Spark Job [INFO] Sample Spark Job Test with local:///usr/lib/spark/examples/jars/spark-examples.jar : PASS ----------------------------------------------------------------- Overall Custom Image Validation Succeeded. -----------------------------------------------------------------

    If the custom image doesn't meet the required configurations or file structures, error messages occur. The returned output provides information about the incorrect configurations or file structures.

Step 4: Publish a custom image

Publish the new Docker image to your Amazon ECR registry.

  1. Run the following command to create an Amazon ECR repository for storing your Docker image. Provide a name for your repository, for example, emr6.6_custom_repo. Replace us-west-2 with your Region.

    aws ecr create-repository \ --repository-name emr6.6_custom_repo \ --image-scanning-configuration scanOnPush=true \ --region us-west-2

    For more information, see Create a repository in the Amazon ECR User Guide.

  2. Run the following command to authenticate to your default registry.

    aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.us-west-2.amazonaws.com

    For more information, see Authenticate to your default registry in the Amazon ECR User Guide.

  3. Tag and publish an image to the Amazon ECR repository you created.

    Tag the image.

    docker tag emr6.6_custom aws_account_id.dkr.ecr.us-west-2.amazonaws.com/emr6.6_custom_repo

    Push the image.

    docker push aws_account_id.dkr.ecr.us-west-2.amazonaws.com/emr6.6_custom_repo

    For more information, see Push an image to Amazon ECR in the Amazon ECR User Guide.

Step 5: Submit a Spark workload in Amazon EMR using a custom image

After a custom image is built and published, you can submit an Amazon EMR on EKS job using a custom image.

First, create a start-job-run-request.json file and specify the spark.kubernetes.container.image parameter to reference the custom image, as the following example JSON file demonstrates.

Note

You can use local:// scheme to refer to files available in the custom image as shown with entryPoint argument in the JSON snippet below. You can also use the local:// scheme to refer to application dependencies. All files and dependencies that are referred using local:// scheme must already be present at the specified path in the custom image.

{ "name": "spark-custom-image", "virtualClusterId": "virtual-cluster-id", "executionRoleArn": "execution-role-arn", "releaseLabel": "emr-6.6.0-latest", "jobDriver": { "sparkSubmitJobDriver": { "entryPoint": "local:///usr/lib/spark/examples/jars/spark-examples.jar", "entryPointArguments": [ "10" ], "sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi --conf spark.kubernetes.container.image=123456789012.dkr.ecr.us-west-2.amazonaws.com/emr6.6_custom_repo" } } }

You can also reference the custom image with applicationConfiguration properties as the following example demonstrates.

{ "name": "spark-custom-image", "virtualClusterId": "virtual-cluster-id", "executionRoleArn": "execution-role-arn", "releaseLabel": "emr-6.6.0-latest", "jobDriver": { "sparkSubmitJobDriver": { "entryPoint": "local:///usr/lib/spark/examples/jars/spark-examples.jar", "entryPointArguments": [ "10" ], "sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi" } }, "configurationOverrides": { "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.kubernetes.container.image": "123456789012.dkr.ecr.us-west-2.amazonaws.com/emr6.6_custom_repo" } } ] } }

Then run the start-job-run command to submit the job.

aws emr-containers start-job-run --cli-input-json file://./start-job-run-request.json

In the JSON examples above, replace emr-6.6.0-latest with your Amazon EMR release version. We strongly recommend that you use the -latest release version to ensure that the selected version contains the latest security updates. For more information about Amazon EMR release versions and their image tags, see How to select a base image URI.

Note

You can use spark.kubernetes.driver.container.image and spark.kubernetes.executor.container.image to specify a different image for driver and executor pods.

Customize Docker images for interactive endpoints

You can also customize Docker images for interactive endpoints so that you can run customized base kernel images. This helps you ensure that you have the dependencies you need when you run interactive workloads from EMR Studio.

  1. Follow the Steps 1-4 outlined above to customize a Docker image. For Amazon EMR 6.9.0 releases and later, you can get the base image URI from Amazon ECR Public Gallery. For releases before Amazon EMR 6.9.0, you can get the image in Amazon ECR Registry accounts in each AWS Region, and the only difference is the base image URI in your Dockerfile. The base image URI follows the format:

    ECR-registry-account.dkr.ecr.Region.amazonaws.com/notebook-spark/container-image-tag

    You need to use notebook-spark in the base image URI, instead of spark. The base image contains the Spark runtime and the notebook kernels that run with it. For more information about selecting Regions and container image tags, see How to select a base image URI.

    Note

    Currently only overrides of base images are supported and introducing completely new kernels of other types than the base images AWS provides is not supported.

  2. Create an interactive endpoint that can be used with the custom image.

    First, create a JSON file called custom-image-managed-endpoint.json with the following contents.

    { "name": "endpoint-name", "virtualClusterId": "virtual-cluster-id", "type": "JUPYTER_ENTERPRISE_GATEWAY", "releaseLabel": "emr-6.6.0-latest", "executionRoleArn": "execution-role-arn", "certificateArn": "certificate-arn", "configurationOverrides": { "applicationConfiguration": [ { "classification": "jupyter-kernel-overrides", "configurations": [ { "classification": "python3", "properties": { "container-image": "123456789012.dkr.ecr.us-west-2.amazonaws.com/custom-notebook-python:latest" } }, { "classification": "spark-python-kubernetes", "properties": { "container-image": "123456789012.dkr.ecr.us-west-2.amazonaws.com/custom-notebook-spark:latest" } } ] } ] } }

    Next, create an interactive endpoint using the configurations specified in the JSON file, as the following example demonstrates.

    aws emr-containers create-managed-endpoint --cli-input-json custom-image-managed-endpoint.json

    For more information, see Create an interactive endpoint for your virtual cluster.

  3. Connect to the interactive endpoint via EMR Studio. For more information, see Connecting from Studio.

Work with multi-architecture images

Amazon EMR on EKS supports multi-architecture container images for Amazon Elastic Container Registry (Amazon ECR). For more information, see Introducing multi-architecture container images for Amazon ECR.

Amazon EMR on EKS custom images support both AWS Graviton-based EC2 instances and non-Graviton-based EC2 instances. The Graviton-based images are stored in the same image repositories in Amazon ECR as non-Graviton-based images.

For example, to inspect the Docker manifest list for 6.6.0 images, run the following command.

docker manifest inspect 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0:latest

Here is the output. The arm64 architecture is for Graviton instance. The amd64 is for non-Graviton instance.

{ "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [ { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 1805, "digest": "xxx123:6b971cb47d11011ab3d45fff925e9442914b4977ae0f9fbcdcf5cfa99a7593f0", "platform": { "architecture": "arm64", "os": "linux" } }, { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 1805, "digest": "xxx123:6f2375582c9c57fa9838c1d3a626f1b4fc281e287d2963a72dfe0bd81117e52f", "platform": { "architecture": "amd64", "os": "linux" } } ] }

Take the following steps to create multi-architecture images:

  1. Create a Dockerfile with the following contents so that you can pull the arm64 image.

    FROM --platform=arm64 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0:latest USER root RUN pip3 install boto3 // install customizations here USER hadoop:hadoop
  2. Follow the instructions at Introducing multi-architecture container images for Amazon ECR to build a multi-architecture image.

    Note

    You must create arm64 images on arm64 instances. Similarly, you must build amd64 images on amd64 instances.

    You can also build multi-architecture images without building on each specific instance type with the Docker buildx command. For more information, see Leverage multi-CPU architecture support.

  3. After you build the multi-architecture image, you can submit a job with the same spark.kubernetes.container.image parameter and point it to the image. In a heterogeneous cluster with both AWS Graviton-based and non-Graviton-based EC2 instances, the instance determines the correct architecture image based on the instance architecture that pulls the image.