Use Amazon SageMaker Elastic Inference (EI)
Starting April 15, 2023, AWS will not onboard new customers to Amazon Elastic Inference (EI), and will help current customers migrate their workloads to options that offer better price and performance. After April 15, 2023, new customers will not be able to launch instances with Amazon EI accelerators in Amazon SageMaker, Amazon ECS, or Amazon EC2.
Machine learning (ML) on AWS helps you innovate faster with the most comprehensive set of ML services and infrastructure made available in a low-cost,
pay as-you-go usage model. AWS continuously delivers better performing and lower cost infrastructure for ML inference workloads.
AWS launched Amazon Elastic Inference (EI) in 2018 to enable customers to attach low-cost GPU-powered acceleration to Amazon EC2, Amazon SageMaker instances, or Amazon Elastic Container Service (ECS) tasks
to reduce the cost of running deep learning inference by up to 75% compared to standalone GPU based instances such as Amazon EC2 P4d and Amazon EC2 G5.
In 2019, AWS launched AWS Inferentia, Amazon's first custom silicon designed to accelerate deep learning workloads by providing high performance
inference in the cloud. Amazon EC2 Inf1 instances based on AWS Inferentia chips deliver up 2.3x higher throughput and up to 70% lower cost per inference
than comparable current generation GPU-based Amazon EC2 instances. With the availability of new accelerated compute options such as AWS Inferentia
and Amazon EC2 G5 instances, the benefit of attaching a fractional GPU to a CPU host instance using Amazon EI has diminished. For example, customers
hosting models on Amazon EI who move to ml.inf1.xlarge
instances can get up to 56% in cost savings and 2x performance improvement.
Customers can use Amazon SageMaker Inference Recommender to help them choose the best alternative instances to Amazon EI for deploying their ML models.
Frequently asked questions
-
Why is Amazon encouraging customers to move workloads from Amazon Elastic Inference (EI) to newer hardware acceleration options such as AWS Inferentia?
Customers get better performance at a much better price than Amazon EI with new hardware accelerator options such as AWS Inferentia
for their inference workloads. AWS Inferentia is designed to provide high performance inference in the cloud, to drive down the total cost of inference, and to make it easy for developers to integrate machine learning into their business applications. To enable customers to benefit from such newer generation hardware accelerators, we will not onboard new customers to Amazon EI after April 15, 2023. -
Which AWS services are impacted by the move to stop onboarding new customers to Amazon Elastic Inference (EI)?
This announcement will affect Amazon EI accelerators attached to any Amazon EC2, Amazon SageMaker instances, or Amazon Elastic Container Service (ECS) tasks. In Amazon SageMaker, this applies to both endpoints and notebook kernels using Amazon EI accelerators.
-
Will I be able to create a new Amazon Elastic Inference (EI) accelerator after April 15, 2023?
No, if you are a new customer and have not used Amazon EI in the past 30 days, then you will not be able create a new Amazon EI instance in your AWS account after April 15, 2023. However, if you have used an Amazon EI accelerator at least once in the past 30 days, you can attach a new Amazon EI accelerator to your instance.
-
How do I evaluate alternative instance options for my current Amazon SageMaker Inference Endpoints?
Amazon SageMaker Inference Recommender can help you identify cost-effective deployments to migrate existing workloads from Amazon Elastic Inference (EI) to an appropriate ML instance supported by SageMaker.
-
How do I change the instance type for my existing endpoint in Amazon SageMaker?
You can change the instance type for your existing endpoint by doing the following:
First, create a new EndpointConfig that uses the new instance type. If you have an autoscaling policy, delete the existing autoscaling policy.
Call UpdateEndpoint while specifying your newly created EndpointConfig.
Wait for your endpoint to change status to
InService
. This will take approximately 10-15 minutes.Finally, if you need autoscaling for your new endpoint, create a new autoscaling policy for this new endpoint and ProductionVariant.
-
How do I change the instance type for my existing Amazon SageMaker Notebook Instance using Amazon Elastic Inference (EI)?
Choose Notebook instances in the SageMaker console, and then choose the Notebook Instance you want to update. Make sure the Notebook Instance has a
Stopped
status. Finally, you can choose Edit and change your instance type. Make sure that, when your Notebook Instance starts up, you select the right kernel for your new instance. -
Is there a specific instance type which is a good alternative to Amazon Elastic Inference (EI)?
Every machine learning workload is unique. We recommend using Amazon SageMaker Inference Recommender to help you identify the right instance type for your ML workload, performance requirements, and budget. AWS Inferentia
, specifically inf1.xlarge
, is the best high performance and low-cost alternative for Amazon EI customers.
Migrate from Amazon Elastic Inference to other instances
The following information can help you migrate your SageMaker-hosted endpoints from instances that use Amazon Elastic Inference accelerators to other instances. The advice varies depending on your framework.
PyTorch
If you're migrating from PyTorch, use the following guidelines.
1. Choose the right instance type
Every machine learning workload is unique. We recommend using Amazon SageMaker Inference Recommender to help you identify the right
instance type for your ML workload, performance requirements, and budget. AWS Inferentia, specifically inf1.xlarge
,
is the best high performance and low-cost alternative for Amazon Elastic Inference customers.
In our load testing with Inference Recommender, g4dn.xlarge
instances performed
better than m5.large
instances with eia.2large
attached.
With Amazon Elastic Inference, you have to pay the additional cost of the ML
instance to which the accelerator is attached. Amazon Elastic Inference also only
supports PyTorch 1.5 and TensorFlow 2.3. If you migrate to ml.g4dn
instances, you can use the latest versions of PyTorch 1.11 and TensorFlow 2.9.
Additionally, ml.g4dn
and AWS Inferentia are available in all AWS
Regions, whereas Amazon Elastic Inference is only available in 6 Regions. Both AWS
Inferentia and ml.g4dn
offer better performance at lower price for most
ML inference workloads.
2. Modify inference.py
Modify your inference.py
file to remove any Elastic
Inference-specific required changes and use default handlers. Based on different
user cases, you might have different input and output handlers, but the main changes
you must make are in the model loading handler functions model_fn
and
predict_fn
. Remove the Elastic Inference-specific predict handler
predict_fn
and restore the model loading handler
model_fn
to the default format. The following example shows how to
do this, with the parts you should remove from inference.py
commented
out:
from __future__ import print_function import os import torch import torch.nn as nn import torch.nn.functional as F import numpy as np def model_fn(model_dir, context): model = {customer_model} # if torch.__version__ in VERSIONS_USE_NEW_API: # import torcheia # loaded_model = loaded_model.eval() # loaded_model = torcheia.jit.attach_eia(loaded_model, 0) with open(os.path.join(model_dir, 'model.pth'), 'rb') as f: model.load_state_dict(torch.load(f)) return model # def predict_fn(input_data, model): # logger.info( # "Performing EIA inference with Torch JIT context with input of size {}".format( # input_data.shape # ) # ) # device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # input_data = input_data.to(device) # with torch.no_grad(): # if torch.__version__ in VERSIONS_USE_NEW_API: # import torcheia # # torch._C._jit_set_profiling_executor(False) # with torch.jit.optimized_execution(True): # return model.forward(input_data) # else: # with torch.jit.optimized_execution(True, {"target_device": "eia:0"}): # return model(input_data) def predict_fn(input_data, model): return model(input_data)
3. Create a model
Create a new model that points to your modified inference.py
file.
You can keep the inference.py
file locally and point to it by
specifying source_dir
and entry_point
or tar the
inference.py
file into the model tarball. The following example
shows the former case:
from sagemaker.pytorch import PyTorchModel pytorch = PyTorchModel( model_data={model_data_url}, role=role, entry_point="inference.py", source_dir="code", framework_version="1.5.1", py_version="py3", sagemaker_session=sagemaker_session, )
4. Deploy the model to the endpoint and invoke it
You can use one of the following options for deploying your model after making the preceding changes.
Option 1: Deploy from scratch
You can deploy the model to a new endpoint with a recommended instance from the Accelerated Computing category, such as G4.
predictor = pytorch.deploy( ... # instance_type = "ml.c5.xlarge", instance_type="ml.g4dn.2xlarge", ... response = predictor.predict(payload)
Option 2: Update the existing endpoint
Complete the following steps to update your existing endpoint:
-
Call
CreateEndpointConfig
to create a newEndpointConfig
that uses the new instance type. If you have an autoscaling policy, delete the existing autoscaling policy.endpoint_config_response = sagemaker_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { "VariantName": "variant1", # The name of the production variant. "ModelName": model_name, # The name of new created model "InstanceType": instance_type, # Specify the right-sized instance type. "InitialInstanceCount": 1 # Number of instances to launch initially. } ] )
-
Call
UpdateEndpoint
and specify your newly createdEndpointConfig
.endpoint_config_response = sagemaker_client.update_endpoint( EndpointConfigName=endpoint_config_name, # The name of the new endpoint config just created EndpointName=endpoint_name # The name of the existing endpoint you want to update )
Wait for your endpoint to change status to
InService
. This takes approximately 10–15 minutes.Finally, if you need autoscaling for your new endpoint, create a new autoscaling policy for your new endpoint and
ProductionVariant
.
TensorFlow
If you're migrating from TensorFlow, use the following guidelines.
1. Choose the right instance type
Refer to the 1. Choose the right instance type guidance in the PyTorch section.
2. Deploy the model to the endpoint and invoke it
You can use one of the following options for deploying your model.
Option 1: Deploy from scratch
You can migrate from Elastic Inference by re-deploying the model to a new endpoint
by removing the accelerator_type
field and specifying a right-sized
instance type from the Accelerated Computing
category, such as G4. In the following example, the commented out line causes you to
deploy without using an Elastic Inference accelerator.
predictor = tensorflow_model.deploy( ... instance_type="ml.g4dn.2xlarge" # instance_type="ml.c5.xlarge", # accelerator_type="ml.eia1.medium" ... )
Option 2: Update the existing endpoint
Refer to the Option 2. Update the existing endpoint guidance in Step 4 of the PyTorch section.
MXNet
If you're migrating from MXNet, use the following guidelines.
1. Choose the right instance type
Refer to the 1. Choose the right instance type guidance in the PyTorch section.
2. Deploy the model to the endpoint and invoke it
You can use one of the following options for deploying your model.
Option 1: Deploy from scratch
You can migrate from Elastic Inference by re-deploying the model to a new endpoint
by removing the accelerator_type
field and specifying a right-sized
instance type from the Accelerated Computing
category, such as G4. In the following example, the commented out line causes you to
deploy without using an Elastic Inference accelerator.
predictor = mxnet_model.deploy( ... # instance_type="ml.c5.xlarge", instance_type="ml.g4dn.2xlarge" ... )
Option 2: Update the existing endpoint
Refer to the Option 2: Update the existing endpoint guidance in Step 4 of the PyTorch section.
Topics
Choose an EI Accelerator Type
Consider the following factors when choosing an accelerator type for a hosted model:
-
Models, input tensors and batch sizes influence the amount of accelerator memory you need. Start with an accelerator type that provides at least as much memory as the file size of your trained model. Factor in that a model might use significantly more memory than the file size at runtime.
-
Demands on CPU compute resources, main system memory, and GPU-based acceleration and accelerator memory vary significantly between different kinds of deep learning models. The latency and throughput requirements of the application also determine the amount of compute and acceleration you need. Thoroughly test different configurations of instance types and EI accelerator sizes to make sure you choose the configuration that best fits the performance needs of your application.
For more information on selecting an EI accelerator, see:
Use EI in a SageMaker Notebook Instance
Typically, you build and test machine learning models in a SageMaker notebook before you
deploy them for production. You can attach EI to your notebook instance when you create
the notebook instance. You can set up an endpoint that is hosted locally on the notebook
instance by using the local mode supported by TensorFlow, MXNet, and PyTorch estimators
and models in the Amazon SageMaker Python SDK
Use EI on a Hosted Endpoint
When you are ready to deploy your model for production to provide inferences, you create a SageMaker hosted endpoint. You can attach EI to the instance where your endpoint is hosted to increase its performance at providing inferences. For instructions on how to attach EI to a hosted endpoint instance, see Use EI on Amazon SageMaker Hosted Endpoints.
Frameworks that Support EI
Amazon Elastic Inference is designed to be used with AWS enhanced versions of TensorFlow, Apache MXNet, or PyTorch machine learning frameworks. These enhanced versions of the frameworks are automatically built into containers when you use the Amazon SageMaker Python SDK, or you can download them as binary files and import them in your own Docker containers.
You can download the EI-enabled TensorFlow binary files from the public
amazonei-tensorflow
You can download the EI-enabled MXNet binary files from the public
amazonei-apachemxnet
You can download the Elastic Inference enabled binary for PyTorch
To use Elastic Inference in a hosted endpoint, you can choose any of the following frameworks depending on your needs.
If you need to create a custom container for deploying your model that is
complex and requires extensions to a framework that the SageMaker pre-built containers do not
support, use
the low-level AWS SDK for Python (Boto 3)
Use EI with SageMaker Built-in Algorithms
Currently, the Image Classification - MXNet and Object Detection - MXNet built-in algorithms support EI. For an example
that uses the Image Classification algorithm with EI, see End-to-End Multiclass Image Classification Example
EI Sample Notebooks
The following Sample notebooks provide examples of using EI in SageMaker:
-
Using Amazon Elastic Inference with MXNet on Amazon SageMaker
-
Using Amazon Elastic Inference with MXNet on an Amazon SageMaker Notebook Instance
-
Using Amazon Elastic Inference with Neo-compiled TensorFlow model on SageMaker
-
Using Amazon Elastic Inference with a pre-trained TensorFlow Serving model on SageMaker