Before you get started with Amazon Elastic Inference - Amazon Elastic Inference

Before you get started with Amazon Elastic Inference

Amazon Elastic Inference Service Limits

Before you start using Elastic Inference accelerators, be aware of the following limitations:

Limit Description
Elastic Inference accelerator instance limit You can attach up to five Elastic Inference accelerators by default to each instance at a time, and only during instance launch. This is adjustable. We recommend testing the optimal setup before deploying to production.
Elastic Inference Sharing You cannot share an Elastic Inference accelerator between instances.
Elastic Inference Transfer You cannot detach an Elastic Inference accelerator from an instance or transfer it to another instance. If you no longer need an Elastic Inference accelerator, you must terminate your instance. You cannot change the Elastic Inference accelerator type. Terminate the instance and launch a new instance with a different Elastic Inference accelerator specification.
Supported Libraries Only the Amazon Elastic Inference enhanced MXNet, TensorFlow, and PyTorch libraries can make inference calls to Elastic Inference accelerators.
Elastic Inference Attachment Elastic Inference accelerators can only be attached to instances in a VPC.
Reserving accelerator capacity Pricing for Elastic Inference accelerators is available at On-Demand Instance rates only. You can attach an accelerator to a Reserved Instance, Scheduled Reserved Instance, or Spot Instance. However, the On-Demand Instance price for the Elastic Inference accelerator applies. You cannot reserve or schedule Elastic Inference accelerator capacity.

Choosing an Instance and Accelerator Type for Your Model

Demands on CPU compute resources, CPU memory, GPU-based acceleration, and GPU memory vary significantly between different types of deep learning models. The latency and throughput requirements of the application also determine the amount of instance compute and Elastic Inference acceleration you need. Consider the following when you choose an instance and accelerator type combination for your model:

  • Before you evaluate the right combination of resources for your model or application stack, you should determine the target latency, throughput needs, and constraints. For example, let's assume your application must respond within 300 milliseconds (ms). If data retrieval (including any authentication) and preprocessing takes 200ms, you have a 100-ms window to work with for the inference request. Using this analysis, you can determine the lowest cost infrastructure combination that meets these targets.

  • Start with a reasonably small combination of resources. For example, a budget-friendly c5.xlarge CPU instance type along with an eia2.medium accelerator type. This combination has been tested to work well for various computer vision workloads (including a large version of ResNet: ResNet-200). The combination gives comparable or better performance than a more costly p2.xlarge GPU instance. You can then resize the instance or accelerator type depending on your latency targets.

  • I/O data transfer between instance and accelerator adds to inference latency because Elastic Inference accelerators are attached over the network.

  • If you use multiple models with your accelerator, you might need a larger accelerator size to better support both compute and memory needs. This also applies if you use the same model from multiple application processes on the instance.

  • You can convert your model to mixed precision, which uses the higher FP16 TFLOPS of the accelerator, to provide lower latency and higher performance.

Using Amazon Elastic Inference with EC2 Auto Scaling

When you create an Auto Scaling group, you can specify the information required to configure the Amazon EC2 instances. This includes Elastic Inference accelerators. To do this, specify a launch template with your instance configuration and the Elastic Inference accelerator type.