Deploy models with DJL Serving
DJL Serving is a high performance universal stand-alone model serving solution. It takes a deep learning model, several models, or workflows and makes them available through an HTTP endpoint.
You can use one of the DJL Serving Deep Learning
Containers (DLCs) to serve your models on AWS. To learn about the
supported model types and frameworks, see the DJL Serving GitHub
repository
DJL Serving offers many features that help you to deploy your models with high performance:
-
Ease of use – DJL Serving can serve most models without any modifications. You bring your model artifacts, and DJL Serving can host them.
-
Multiple device and accelerator support – DJL Serving supports deploying models on CPUs, GPUs, and AWS Inferentia.
-
Performance – DJL Serving runs multithreaded inference in a single Java virtual machine (JVM) to boost throughput.
-
Dynamic batching – DJL Serving supports dynamic batching to increase throughput.
-
Auto scaling – DJL Serving automatically scales workers up or down based on the traffic load.
-
Multi-engine support – DJL Serving can simultaneously host models using different frameworks (for example, PyTorch and TensorFlow).
-
Ensemble and workflow models – DJL Serving supports deploying complex workflows comprised of multiple models and can execute parts of the workflow on CPUs and other parts on GPUs. Models within a workflow can leverage different frameworks.
The following sections describe how to set up an endpoint with DJL Serving on SageMaker.
Getting started
To get started, ensure that you have the following prerequisites:
-
Ensure that you have access to an AWS account. Set up your environment so that the AWS CLI can access your account through either an AWS IAM user or an IAM role. We recommend using an IAM role. For the purposes of testing in your personal account, you can attach the following managed permissions policies to the IAM role:
-
Ensure that you have the docker
client set up on your system. -
Log in to Amazon Elastic Container Registry and set the following environment variables:
export ACCOUNT_ID=
<your_account_id>
export REGION=<your_region>
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com -
Pull the docker image.
docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118
For all of the available DJL Serving container images, see the large model inference containers
and the DJL Serving CPU inference containers . When choosing an image from the tables in the preceding links, replace the AWS region in the example URL column with the region you are in. The DLCs are available in the regions listed in the table at the top of the Available Deep Learning Containers Images page.
Customize your container
You can add packages to the base DLC images to customize your container. Suppose
you want to add a package to the
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118
docker image. You must create a dockerfile with your desired image as the base
image, add the required packages, and push the image to Amazon ECR.
To add a package, complete the following steps:
-
Specify instructions for running your desired libraries or packages in the base image's dockerfile.
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118 ## add custom packages/libraries RUN git clone https://github.com/awslabs/amazon-sagemaker-examples
-
Build the Docker image from the dockerfile. Specify your Amazon ECR repository, the name of the base image, and a tag for the image. If you don't have an Amazon ECR repository, see Using Amazon ECR with the AWS CLI in the Amazon ECR User Guide for instructions on how to create one.
docker build -f Dockerfile -t <registry>/<image_name>:<image_tag>
-
Push the Docker image to your Amazon ECR repository.
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/<image_name>:<image_tag>
You should now have a customized container image that you can use for model
serving. For more examples of customizing your container, see Building AWS Deep Learning Containers Custom Images
Prepare your model artifacts
Before deploying your model on SageMaker, you must package your model artifacts in a
.tar.gz
file. DJL Serving accepts the following artifacts in your
archive:
-
Model checkpoint: Files that store your model weights.
-
serving.properties
: A configuration file that you can add for each model. Placeserving.properties
in the same directory as your model file. -
model.py
: The inference handler code. This is only applicable when using Python mode. If you don't specifymodel.py
, djl-serving uses one of the default handlers.
The following is an example of a model.tar.gz
structure:
- model_root_dir # root directory - serving.properties - model.py # your custom handler file for Python, if you choose not to use the default handlers provided by DJL Serving - model binary files # used for Java mode, or if you don't want to use option.model_id and option.s3_url for Python mode
DJL Serving supports Java engines powered by DJL or Python engines. Not all of the
preceding artifacts are required; the required artifacts vary based on the mode you
choose. For example, in Python mode, you only need to specify
option.model_id
in the serving.properties
file; you
don't need to specify the model checkpoint inside LMI containers. In Java mode, you
are required to package the model checkpoint. For more details on how to configure
serving.properties
and operate with different engines, see DJL Serving Operation Modes
Use single model endpoints to deploy with DJL Serving
After preparing your model artifacts, you can deploy your model to a SageMaker endpoint. This section describes how to deploy a single model to an endpoint with DJL Serving. If you're deploying multiple models, skip this section and go to Use multi-model endpoints to deploy with DJL Serving.
The following example shows you a method to create a model object using the Amazon SageMaker Python SDK. You'll need to specify the following fields:
-
image_uri
: You can either retrieve one of the base DJL Serving images as shown in this example, or you can specify a custom Docker image from your Amazon ECR repository, if you followed the instructions in Customize your container. -
model_s3_url
: This should an Amazon S3 URI that points to your.tar.gz
file. -
model_name
: Specify a name for the model object.
import boto3 import sagemaker from sagemaker.model import Model from sagemaker import image_uris, get_execution_role aws_region = "aws-region" sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=aws_region)) role = get_execution_role() def create_model(model_name, model_s3_url): # Get the DJL DeepSpeed image uri image_uri = image_uris.retrieve( framework="djl-deepspeed", region=sagemaker_session.boto_session.region_name, version="0.20.0" ) model = Model( image_uri=image_uri, model_data=model_s3_url, role=role, name=model_name, sagemaker_session=sagemaker_session, ) return model
Use multi-model endpoints to deploy with DJL Serving
If you want to deploy multiple models to an endpoint, SageMaker offers multi-model endpoints, which are a scalable and cost-effective solution to deploying large numbers of models. DJL Serving also supports loading multiple models simultaneously and running inference on each of the models concurrently. DJL Serving containers adhere to the SageMaker multi-model endpoints contracts and can be used to deploy multi-model endpoints.
Each individual model artifact needs to be packaged in the same way as described
in the previous section Prepare your model
artifacts.
You can set model-specific configurations in the serving.properties
file
and model-specific inference handler code in model.py
. For a multi-model endpoint, models
need to be arranged in the following way:
root_dir |-- model_1.tar.gz |-- model_2.tar.gz |-- model_3.tar.gz . . .
The Amazon SageMaker Python SDK uses the MultiDataModelmodel_data_prefix
argument to the MultiDataModel
constructor.
DJL Serving also provides several configuration parameters to manage model memory requirements, such as required_memory_mb
and reserved_memory_mb
, that can be configured for each model in the serving.properties
The auto scaling feature of DJL Serving makes it easy to ensure that the models are
scaled appropriately for incoming traffic. By default, DJL Serving determines the
maximum number of workers for a model that can be supported based on the hardware
available (such as CPU cores or GPU devices). You can set lower and upper bounds for each
model to ensure that a minimum traffic level can always be served, and that a single
model does not consume all available resources. You can set the following properties
in the serving.properties
gpu.minWorkers
: Minimum number of workers for GPUs.gpu.maxWorkers
: Maximum number of workers for GPUs.cpu.minWorkers
: Minimum number of workers for CPUs.cpu.maxWorkers
: Maximum number of workers for CPUs.
For an end-to-end example of how to deploy a multi-model endpoint on SageMaker
using a DJL Serving container, see the example notebook Multi-Model-Inference-Demo.ipynb