Use Amazon SageMaker Elastic Inference (EI)
By using Amazon Elastic Inference (EI), you can speed up the throughput and decrease the latency of getting real-time inferences from your deep learning models that are deployed as Amazon SageMaker hosted models, but at a fraction of the cost of using a GPU instance for your endpoint. EI allows you to add inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance. Add an EI accelerator in one of the available sizes to a deployable model in addition to a CPU instance type, and then add that model as a production variant to an endpoint configuration that you use to deploy a hosted endpoint. You can also add an EI accelerator to a SageMaker notebook instance so that you can test and evaluate inference performance when you are building your models.
Elastic Inference is supported in EI-enabled versions of TensorFlow, Apache MXNet, and PyTorch. To use any other deep learning framework, export your model by using ONNX, and then import your model into MXNet. You can then use your model with EI as an MXNet model. For information about importing an ONNX model into MXNet, see Importing an ONNX model into MXNet.
Topics
How EI Works
Amazon Elastic Inference accelerators are network attached devices that work along with SageMaker instances in your endpoint to accelerate your inference calls. Elastic Inference accelerates inference by allowing you to attach fractional GPUs to any SageMaker instance. You can select the client instance to run your application and attach an Elastic Inference accelerator to use the right amount of GPU acceleration for your inference needs. Elastic Inference helps you lower your cost when not fully utilizing your GPU instance for inference. We recommend trying Elastic Inference with your model using different CPU instances and accelerator sizes.
The following EI accelerator types are available. You can configure your endpoints or notebook instances with any EI accelerator type.
In the table, the throughput in teraflops (TFLOPS) is listed for both single-precision floating-point (F32) and half-precision floating-point (F16) operations. The memory in GB is also listed.
Accelerator Type | F32 Throughput in TFLOPS | F16 Throughput in TFLOPS | Memory in GB |
---|---|---|---|
ml.eia2.medium | 1 | 8 | 2 |
ml.eia2.large | 2 | 16 | 4 |
ml.eia2.xlarge | 4 | 32 | 8 |
ml.eia1.medium | 1 | 8 | 1 |
ml.eia1.large | 2 | 16 | 2 |
ml.eia1.xlarge | 4 | 32 | 4 |
Choose an EI Accelerator Type
Consider the following factors when choosing an accelerator type for a hosted model:
-
Models, input tensors and batch sizes influence the amount of accelerator memory you need. Start with an accelerator type that provides at least as much memory as the file size of your trained model. Factor in that a model might use significantly more memory than the file size at runtime.
-
Demands on CPU compute resources, main system memory, and GPU-based acceleration and accelerator memory vary significantly between different kinds of deep learning models. The latency and throughput requirements of the application also determine the amount of compute and acceleration you need. Thoroughly test different configurations of instance types and EI accelerator sizes to make sure you choose the configuration that best fits the performance needs of your application.
For more information on selecting an EI accelerator, see:
Use EI in a SageMaker Notebook Instance
Typically, you build and test machine learning models in a SageMaker notebook before you
deploy them for production. You can attach EI to your notebook instance when you create
the notebook instance. You can set up an endpoint that is hosted locally on the notebook
instance by using the local mode supported by TensorFlow, MXNet, and PyTorch estimators
and models in the Amazon SageMaker Python SDK
Use EI on a Hosted Endpoint
When you are ready to deploy your model for production to provide inferences, you create a SageMaker hosted endpoint. You can attach EI to the instance where your endpoint is hosted to increase its performance at providing inferences. For instructions on how to attach EI to a hosted endpoint instance, see Use EI on Amazon SageMaker Hosted Endpoints.
Frameworks that Support EI
Amazon Elastic Inference is designed to be used with AWS enhanced versions of TensorFlow, Apache MXNet, or PyTorch machine learning frameworks. These enhanced versions of the frameworks are automatically built into containers when you use the Amazon SageMaker Python SDK, or you can download them as binary files and import them in your own Docker containers.
You can download the EI-enabled TensorFlow binary files from the public
amazonei-tensorflow
You can download the EI-enabled MXNet binary files from the public
amazonei-apachemxnet
You can download the EI-enabled PyTorch binary files from the public
amazonei-pytorch
To use Elastic Inference in a hosted endpoint, you can choose any of the following frameworks depending on your needs.
If you need to create a custom container for deploying your model that is
complex and requires extensions to a framework that the SageMaker pre-built containers do not
support, use
the low-level AWS SDK for Python (Boto 3)
Use EI with SageMaker Built-in Algorithms
Currently, the Image Classification - MXNet and Object Detection - MXNet built-in algorithms support EI. For an example
that uses the Image Classification algorithm with EI, see End-to-End Multiclass Image Classification Example
EI Sample Notebooks
The following Sample notebooks provide examples of using EI in SageMaker:
-
Using Amazon Elastic Inference with MXNet on Amazon SageMaker
-
Using Amazon Elastic Inference with MXNet on an Amazon SageMaker Notebook Instance
-
Using Amazon Elastic Inference with Neo-compiled TensorFlow model on SageMaker
-
Using Amazon Elastic Inference with a pre-trained TensorFlow Serving model on SageMaker