How to Build Your Own Processing Container (Advanced Scenario) - Amazon SageMaker AI

How to Build Your Own Processing Container (Advanced Scenario)

You can provide Amazon SageMaker Processing with a Docker image that has your own code and dependencies to run your data processing, feature engineering, and model evaluation workloads. The following provides information on how to build your own processing container.

The following example of a Dockerfile builds a container with the Python libraries scikit-learn and pandas, which you can run as a processing job.

FROM python:3.7-slim-buster # Install scikit-learn and pandas RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3 # Add a Python script and configure Docker to run it ADD processing_script.py / ENTRYPOINT ["python3", "/processing_script.py"]

For an example of a processing script, see Get started with SageMaker Processing.

Build and push this Docker image to an Amazon Elastic Container Registry (Amazon ECR) repository and ensure that your SageMaker AI IAM role can pull the image from Amazon ECR. Then you can run this image on Amazon SageMaker Processing.

How Amazon SageMaker Processing Configures Your Processing Container

Amazon SageMaker Processing provides configuration information to your processing container through environment variables and two JSON files—/opt/ml/config/processingjobconfig.json and /opt/ml/config/resourceconfig.json— at predefined locations in the container.

When a processing job starts, it uses the environment variables that you specified with the Environment map in the CreateProcessingJob request. The /opt/ml/config/processingjobconfig.json file contains information about the hostnames of your processing containers, and is also specified in the CreateProcessingJob request.

The following example shows the format of the /opt/ml/config/processingjobconfig.json file.

{ "ProcessingJobArn": "<processing_job_arn>", "ProcessingJobName": "<processing_job_name>", "AppSpecification": { "ImageUri": "<image_uri>", "ContainerEntrypoint": null, "ContainerArguments": null }, "Environment": { "KEY": "VALUE" }, "ProcessingInputs": [ { "InputName": "input-1", "S3Input": { "LocalPath": "/opt/ml/processing/input/dataset", "S3Uri": "<s3_uri>", "S3DataDistributionType": "FullyReplicated", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3CompressionType": "None", "S3DownloadMode": "StartOfJob" } } ], "ProcessingOutputConfig": { "Outputs": [ { "OutputName": "output-1", "S3Output": { "LocalPath": "/opt/ml/processing/output/dataset", "S3Uri": "<s3_uri>", "S3UploadMode": "EndOfJob" } } ], "KmsKeyId": null }, "ProcessingResources": { "ClusterConfig": { "InstanceCount": 1, "InstanceType": "ml.m5.xlarge", "VolumeSizeInGB": 30, "VolumeKmsKeyId": null } }, "RoleArn": "<IAM role>", "StoppingCondition": { "MaxRuntimeInSeconds": 86400 } }

The /opt/ml/config/resourceconfig.json file contains information about the hostnames of your processing containers. Use the following hostnames when creating or running distributed processing code.

{ "current_host": "algo-1", "hosts": ["algo-1","algo-2","algo-3"] }

Don't use the information about hostnames contained in /etc/hostname or /etc/hosts because it might be inaccurate.

Hostname information might not be immediately available to the processing container. We recommend adding a retry policy on hostname resolution operations as nodes become available in the cluster.