Use Amazon SageMaker Elastic Inference (EI) - Amazon SageMaker

Use Amazon SageMaker Elastic Inference (EI)

Starting April 15, 2023, AWS will not onboard new customers to Amazon Elastic Inference (EI), and will help current customers migrate their workloads to options that offer better price and performance. After April 15, 2023, new customers will not be able to launch instances with Amazon EI accelerators in Amazon SageMaker, Amazon ECS, or Amazon EC2.

Machine learning (ML) on AWS helps you innovate faster with the most comprehensive set of ML services and infrastructure made available in a low-cost, pay as-you-go usage model. AWS continuously delivers better performing and lower cost infrastructure for ML inference workloads. AWS launched Amazon Elastic Inference (EI) in 2018 to enable customers to attach low-cost GPU-powered acceleration to Amazon EC2, Amazon SageMaker instances, or Amazon Elastic Container Service (ECS) tasks to reduce the cost of running deep learning inference by up to 75% compared to standalone GPU based instances such as Amazon EC2 P4d and Amazon EC2 G5. In 2019, AWS launched AWS Inferentia, Amazon's first custom silicon designed to accelerate deep learning workloads by providing high performance inference in the cloud. Amazon EC2 Inf1 instances based on AWS Inferentia chips deliver up 2.3x higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances. With the availability of new accelerated compute options such as AWS Inferentia and Amazon EC2 G5 instances, the benefit of attaching a fractional GPU to a CPU host instance using Amazon EI has diminished. For example, customers hosting models on Amazon EI who move to ml.inf1.xlarge instances can get up to 56% in cost savings and 2x performance improvement.

Customers can use Amazon SageMaker Inference Recommender to help them choose the best alternative instances to Amazon EI for deploying their ML models.

Frequently asked questions

  1. Why is Amazon encouraging customers to move workloads from Amazon Elastic Inference (EI) to newer hardware acceleration options such as AWS Inferentia?

    Customers get better performance at a much better price than Amazon EI with new hardware accelerator options such as AWS Inferentia for their inference workloads. AWS Inferentia is designed to provide high performance inference in the cloud, to drive down the total cost of inference, and to make it easy for developers to integrate machine learning into their business applications. To enable customers to benefit from such newer generation hardware accelerators, we will not onboard new customers to Amazon EI after April 15, 2023.

  2. Which AWS services are impacted by the move to stop onboarding new customers to Amazon Elastic Inference (EI)?

    This announcement will affect Amazon EI accelerators attached to any Amazon EC2, Amazon SageMaker instances, or Amazon Elastic Container Service (ECS) tasks. In Amazon SageMaker, this applies to both endpoints and notebook kernels using Amazon EI accelerators.

  3. Will I be able to create a new Amazon Elastic Inference (EI) accelerator after April 15, 2023?

    No, if you are a new customer and have not used Amazon EI in the past 30 days, then you will not be able create a new Amazon EI instance in your AWS account after April 15, 2023. However, if you have used an Amazon EI accelerator at least once in the past 30 days, you can attach a new Amazon EI accelerator to your instance.

  4. How do I evaluate alternative instance options for my current Amazon SageMaker Inference Endpoints?

    Amazon SageMaker Inference Recommender can help you identify cost-effective deployments to migrate existing workloads from Amazon Elastic Inference (EI) to an appropriate ML instance supported by SageMaker.

  5. How do I change the instance type for my existing endpoint in Amazon SageMaker?

    You can change the instance type for your existing endpoint by doing the following:

    1. First, create a new EndpointConfig that uses the new instance type. If you have an autoscaling policy, delete the existing autoscaling policy.

    2. Call UpdateEndpoint while specifying your newly created EndpointConfig.

    3. Wait for your endpoint to change status to InService. This will take approximately 10-15 minutes.

    4. Finally, if you need autoscaling for your new endpoint, create a new autoscaling policy for this new endpoint and ProductionVariant.

  6. How do I change the instance type for my existing Amazon SageMaker Notebook Instance using Amazon Elastic Inference (EI)?

    Choose Notebook instances in the SageMaker console, and then choose the Notebook Instance you want to update. Make sure the Notebook Instance has a Stopped status. Finally, you can choose Edit and change your instance type. Make sure that, when your Notebook Instance starts up, you select the right kernel for your new instance.

  7. Is there a specific instance type which is a good alternative to Amazon Elastic Inference (EI)?

    Every machine learning workload is unique. We recommend using Amazon SageMaker Inference Recommender to help you identify the right instance type for your ML workload, performance requirements, and budget. AWS Inferentia, specifically inf1.xlarge, is the best high performance and low-cost alternative for Amazon EI customers.

Migrate from Amazon Elastic Inference to other instances

The following information can help you migrate your SageMaker-hosted endpoints from instances that use Amazon Elastic Inference accelerators to other instances. The advice varies depending on your framework.

PyTorch

If you're migrating from PyTorch, use the following guidelines.

1. Choose the right instance type

Every machine learning workload is unique. We recommend using Amazon SageMaker Inference Recommender to help you identify the right instance type for your ML workload, performance requirements, and budget. AWS Inferentia, specifically inf1.xlarge, is the best high performance and low-cost alternative for Amazon Elastic Inference customers.

In our load testing with Inference Recommender, g4dn.xlarge instances performed better than m5.large instances with eia.2large attached. With Amazon Elastic Inference, you have to pay the additional cost of the ML instance to which the accelerator is attached. Amazon Elastic Inference also only supports PyTorch 1.5 and TensorFlow 2.3. If you migrate to ml.g4dn instances, you can use the latest versions of PyTorch 1.11 and TensorFlow 2.9. Additionally, ml.g4dn and AWS Inferentia are available in all AWS Regions, whereas Amazon Elastic Inference is only available in 6 Regions. Both AWS Inferentia and ml.g4dn offer better performance at lower price for most ML inference workloads.

2. Modify inference.py

Modify your inference.py file to remove any Elastic Inference-specific required changes and use default handlers. Based on different user cases, you might have different input and output handlers, but the main changes you must make are in the model loading handler functions model_fn and predict_fn. Remove the Elastic Inference-specific predict handler predict_fn and restore the model loading handler model_fn to the default format. The following example shows how to do this, with the parts you should remove from inference.py commented out:

from __future__ import print_function import os import torch import torch.nn as nn import torch.nn.functional as F import numpy as np def model_fn(model_dir, context): model = {customer_model} # if torch.__version__ in VERSIONS_USE_NEW_API: # import torcheia # loaded_model = loaded_model.eval() # loaded_model = torcheia.jit.attach_eia(loaded_model, 0) with open(os.path.join(model_dir, 'model.pth'), 'rb') as f: model.load_state_dict(torch.load(f)) return model # def predict_fn(input_data, model): # logger.info( # "Performing EIA inference with Torch JIT context with input of size {}".format( # input_data.shape # ) # ) # device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # input_data = input_data.to(device) # with torch.no_grad(): # if torch.__version__ in VERSIONS_USE_NEW_API: # import torcheia # # torch._C._jit_set_profiling_executor(False) # with torch.jit.optimized_execution(True): # return model.forward(input_data) # else: # with torch.jit.optimized_execution(True, {"target_device": "eia:0"}): # return model(input_data) def predict_fn(input_data, model): return model(input_data)

3. Create a model

Create a new model that points to your modified inference.py file. You can keep the inference.py file locally and point to it by specifying source_dir and entry_point or tar the inference.py file into the model tarball. The following example shows the former case:

from sagemaker.pytorch import PyTorchModel pytorch = PyTorchModel( model_data={model_data_url}, role=role, entry_point="inference.py", source_dir="code", framework_version="1.5.1", py_version="py3", sagemaker_session=sagemaker_session, )

4. Deploy the model to the endpoint and invoke it

You can use one of the following options for deploying your model after making the preceding changes.

Option 1: Deploy from scratch

You can deploy the model to a new endpoint with a recommended instance from the Accelerated Computing category, such as G4.

predictor = pytorch.deploy( ... # instance_type = "ml.c5.xlarge", instance_type="ml.g4dn.2xlarge", ... response = predictor.predict(payload)

Option 2: Update the existing endpoint

Complete the following steps to update your existing endpoint:

  1. Call CreateEndpointConfig to create a new EndpointConfig that uses the new instance type. If you have an autoscaling policy, delete the existing autoscaling policy.

    endpoint_config_response = sagemaker_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { "VariantName": "variant1", # The name of the production variant. "ModelName": model_name, # The name of new created model "InstanceType": instance_type, # Specify the right-sized instance type. "InitialInstanceCount": 1 # Number of instances to launch initially. } ] )
  2. Call UpdateEndpoint and specify your newly created EndpointConfig.

    endpoint_config_response = sagemaker_client.update_endpoint( EndpointConfigName=endpoint_config_name, # The name of the new endpoint config just created EndpointName=endpoint_name # The name of the existing endpoint you want to update )
  3. Wait for your endpoint to change status to InService. This takes approximately 10–15 minutes.

  4. Finally, if you need autoscaling for your new endpoint, create a new autoscaling policy for your new endpoint and ProductionVariant.

TensorFlow

If you're migrating from TensorFlow, use the following guidelines.

1. Choose the right instance type

Refer to the 1. Choose the right instance type guidance in the PyTorch section.

2. Deploy the model to the endpoint and invoke it

You can use one of the following options for deploying your model.

Option 1: Deploy from scratch

You can migrate from Elastic Inference by re-deploying the model to a new endpoint by removing the accelerator_type field and specifying a right-sized instance type from the Accelerated Computing category, such as G4. In the following example, the commented out line causes you to deploy without using an Elastic Inference accelerator.

predictor = tensorflow_model.deploy( ... instance_type="ml.g4dn.2xlarge" # instance_type="ml.c5.xlarge", # accelerator_type="ml.eia1.medium" ... )

Option 2: Update the existing endpoint

Refer to the Option 2. Update the existing endpoint guidance in Step 4 of the PyTorch section.

MXNet

If you're migrating from MXNet, use the following guidelines.

1. Choose the right instance type

Refer to the 1. Choose the right instance type guidance in the PyTorch section.

2. Deploy the model to the endpoint and invoke it

You can use one of the following options for deploying your model.

Option 1: Deploy from scratch

You can migrate from Elastic Inference by re-deploying the model to a new endpoint by removing the accelerator_type field and specifying a right-sized instance type from the Accelerated Computing category, such as G4. In the following example, the commented out line causes you to deploy without using an Elastic Inference accelerator.

predictor = mxnet_model.deploy( ... # instance_type="ml.c5.xlarge", instance_type="ml.g4dn.2xlarge" ... )

Option 2: Update the existing endpoint

Refer to the Option 2: Update the existing endpoint guidance in Step 4 of the PyTorch section.

Choose an EI Accelerator Type

Consider the following factors when choosing an accelerator type for a hosted model:

  • Models, input tensors and batch sizes influence the amount of accelerator memory you need. Start with an accelerator type that provides at least as much memory as the file size of your trained model. Factor in that a model might use significantly more memory than the file size at runtime.

  • Demands on CPU compute resources, main system memory, and GPU-based acceleration and accelerator memory vary significantly between different kinds of deep learning models. The latency and throughput requirements of the application also determine the amount of compute and acceleration you need. Thoroughly test different configurations of instance types and EI accelerator sizes to make sure you choose the configuration that best fits the performance needs of your application.

For more information on selecting an EI accelerator, see:

Use EI in a SageMaker Notebook Instance

Typically, you build and test machine learning models in a SageMaker notebook before you deploy them for production. You can attach EI to your notebook instance when you create the notebook instance. You can set up an endpoint that is hosted locally on the notebook instance by using the local mode supported by TensorFlow, MXNet, and PyTorch estimators and models in the Amazon SageMaker Python SDK to test inference performance. Elastic Inference enabled PyTorch is not currently supported on notebook instances. For instructions on how to attach EI to a notebook instance and set up a local endpoint for inference, see Attach EI to a Notebook Instance. There are also Elastic Inference-enabled SageMaker Notebook Jupyter kernels for Elastic Inference-enabled versions of TensorFlow and Apache MXNet. For information about using SageMaker notebook instances, see Use Amazon SageMaker Notebook Instances

Use EI on a Hosted Endpoint

When you are ready to deploy your model for production to provide inferences, you create a SageMaker hosted endpoint. You can attach EI to the instance where your endpoint is hosted to increase its performance at providing inferences. For instructions on how to attach EI to a hosted endpoint instance, see Use EI on Amazon SageMaker Hosted Endpoints.

Frameworks that Support EI

Amazon Elastic Inference is designed to be used with AWS enhanced versions of TensorFlow, Apache MXNet, or PyTorch machine learning frameworks. These enhanced versions of the frameworks are automatically built into containers when you use the Amazon SageMaker Python SDK, or you can download them as binary files and import them in your own Docker containers.

You can download the EI-enabled TensorFlow binary files from the public amazonei-tensorflow Amazon S3 bucket to the TensorFlow serving containers. For more information about building a container that uses the EI-enabled version of TensorFlow, see Amazon Elastic Inference with TensorFlow in SageMaker.

You can download the EI-enabled MXNet binary files from the public amazonei-apachemxnet Amazon S3 bucket to the MXNet serving containers. For more information about building a container that uses the EI-enabled version of MXNet, see Amazon Elastic Inference with MXNet in SageMaker.

You can download the Elastic Inference enabled binary for PyTorch. For more information about building a container that uses the EI-enabled version of PyTorch, see Amazon Elastic Inference with PyTorch in SageMaker.

To use Elastic Inference in a hosted endpoint, you can choose any of the following frameworks depending on your needs.

If you need to create a custom container for deploying your model that is complex and requires extensions to a framework that the SageMaker pre-built containers do not support, use the low-level AWS SDK for Python (Boto 3) .

Use EI with SageMaker Built-in Algorithms

Currently, the Image Classification - MXNet and Object Detection - MXNet built-in algorithms support EI. For an example that uses the Image Classification algorithm with EI, see End-to-End Multiclass Image Classification Example.

EI Sample Notebooks

The following Sample notebooks provide examples of using EI in SageMaker: