Choosing instance types for large model inference - Amazon SageMaker

Choosing instance types for large model inference

When deploying deep learning models, we typically balance the cost of hosting these models against the performance in terms of latency, throughput, and accuracy. A core input to this equation is the SageMaker instance type. SageMaker offers many instance types with different GPU devices. For a given model, there are likely multiple suitable instances for hosting the model for inference. Benchmarking can help you to decide with which instance to proceed.

The following sections provide some guidelines that you can follow to determine which instance type to choose for hosting large models. To use these guidelines, you should know the following characteristics of your use case:

  • Model architecture or type, such as OPT, GPTJ, BLOOM, or Bert

  • Data type precision, such as fp32, fp16, bf16, or int8

  • Model size in MB or GB

  • Input and output token sizes

Determining possible instance types

When choosing an instance type, consider the size of the model as well as available GPU devices. The large model inference (LMI) containers are currently only supported on instances with NVIDIA GPUs, and are not supported on Graviton instances. For the most up-to-date information on available GPU instances, see Recommended GPU Instances.

When deploying large models, the ideal situation is to fit the model on a single GPU. This is the best option with respect to performance as it eliminates the overhead of communication between GPU devices. For some models it is simply impossible to fit them on a single GPU due to model size. For other models, they may fit on a single GPU, but it may be more cost effective to partition the model across multiple cheaper GPUs.

The following sections show how you can develop a metric to filter the list of available GPUs to ones that are likely to work for your use case.

Determining possible instance types based on data type

The GPUs available on SageMaker differ in their native support of data types. If you are planning to deploy your model with bf16, choose an instance with GPU devices that support compute capability 7.5+. If you are planning to deploy your model with int8, we highly recommend that you select an instance with GPU devices that support compute capability 7.5+ as these GPUs contain int8 tensor cores. However, you can still deploy with int8 using either HuggingFace Accelerate (using bitsandbytes quantization) or DeepSpeed (using ZeroQuant quantization) on lower compute capabilities. You should expect lower performance if you go this route compared to using GPUs with native support for int8 math.

To verify the compute capability of the GPU, see Your GPU Compute Capability on NVIDIA's website.

Estimating a lower bound for the memory required to host a model

To further filter the list of GPUs, determine how large your model is for the data type precision with which you wish to host the model. If the size of the model can fit on a single GPU, and low latency is the highest priority, we recommend choosing an instance with a GPU that has sufficient memory to host this model. You can estimate a lower bound for the memory required to host your model based on the number of parameters in your model and the data type. This lower bound memory estimate (LBME) represents the bare minimum memory in bytes required to load the model parameters into the GPU memory. Calculate the LBME using the following equations:

  • int8LBME = number of parameters

  • fp16 and bf16LBME = 2 × number of parameters

  • fp32LBME = 4 × number of parameters

Determining possible instances based on number of partitions

Determining the level of partitioning to use with your model comes down to the following factors:

  • Size of the model

  • Cost you are willing to pay for an instance

  • Availability of a given instance

  • Your latency requirements

For example, the EleutherAI/GPT-NEOX-20b model takes about 45 GB to host in fp16. You can deploy this model using a p4de.24xlarge instance without any sharing since the GPU memory available per device is 80 GB. This is the only GPU currently on AWS that would support fitting this model fully on the device. With 8GPUs on that instance, you can host 8 copies of the model. Alternatively you can also deploy this model with 2-way partitioning on a g5.12xlarge With 4 GPUs, you can host 2 copies of the model. Using 4 g5.12xlarge instances to host 8 copies of this model compared to 1 p4de.24xlarge instance is close to half the cost (though the remaining GPU memory on the p4de.24xlarge supports larger batch sizes). While performance is likely lower on the g5.12xlarge, it might make more sense from a cost perspective.

Some instances may not be readily available in certain AWS Regions. You can check the availability of an instance with a Region/Availability Zone using the AWS CLI:

aws ec2 describe-instance-type-offerings --location-type "availability-zone" --filters Name=location,Values=us-east-2a --region us-east-2 --query "InstanceTypeOfferings[*].[InstanceType]" --output text | sort

Fewer partitions generally results in lower overall latency. The best performance will come from single GPU run, but we highly recommend that you experiment with partitioning to understand the latency impact for your specific model.

You can narrow down the list of possible instances by considering the LBME you calculated in the previous step, and your desired number of partitions. If possible, consider multiple partition values to keep many instance types in consideration. For example, if the LBME is 30 GB, some estimates for different levels of partitioning are:

  • Single GPU deployment on p4d.24xlarge (40 GB per GPU) or p4de.24xlarge (80 GB per GPU)

  • 2 GPUs deployment on p3.8xlarge (16 GB per GPU) or g5.12xlarge (24 GB per GPU)

Choosing an engine

LMI deep learning containers (DLCs) provide support for DeepSpeed, FasterTransformer and HuggingFace Accelerate backends. All of these frameworks can be used to deploy and host large models partitioned across multiple GPUs. You can typically expect higher performance (lower latency or higher throughput) with DeepSpeed or FasterTransformer, but these engines do not offer optimized inference for all model architectures. All frameworks also implement model parallelism differently. DeepSpeed and FasterTransformer uses tensor parallelism, which is typically more performant at the cost of higher GPU memory usage. HuggingFace Accelerate uses pipeline parallelism which uses less memory but is also less performant.

We recommend that you use DeepSpeed or FasterTransformer when possible because the fused CUDA kernels significantly improve performance compared to using HuggingFace Accelerate. DeepSpeed currently offers fused kernel implementations for the following model architectures:

  • Bert

  • DistilBert

  • GPT Neo, GPT Neo X, GPT2

  • OPT


  • Megatron

  • Stable Diffusion

FasterTransformer offers fused kernel implementations for the following model architectures:

  • GPT2

  • OPT


  • T5

For other model architectures there aren't any significant difference. We recommend that you experiment with both DeepSpeed and HuggingFace Accelerate to determine which engine works best for your model. To see which models FasterTransformer supports, see the FasterTransformer support matrix.

To choose an instance type, adjust your LBME for the specific engines. Estimating the expected GPU usage also differs between DeepSpeed, FasterTransformer and HuggingFace Accelerate due to differences in how they run model parallelism.

DeepSpeed and FasterTransformer uses tensor parallelism. Some modules, such as embeddings, do not support tensor parallelism. These modules are replicated across all GPUs. HuggingFace Accelerate uses naive pipeline partitioning, which does not result in replicated modules. Thus with DeepSpeed or FasterTransformer you can typically expect memory requirements to increase with the tensor parallelism degree.

The following sections show how you can adjust your LBME to get a loaded model memory estimate (LMME), which is an estimate of the memory required per GPU to load your model.

HuggingFace Accelerate

With HuggingFace Accelerate, the pipeline partitioning does not significantly increase the memory required to load the model. No sub-modules need to be duplicated across GPUs.

  • For single GPU deployment, LMME = 1.15 × LBME.

  • For multiple GPUs deployment, LMME = (1.15 + 0.035 × number of partitions) × LBME. Here the number of partitions is assumed to be a multiple of 2.

DeepSpeed or FasterTransformer

With DeepSpeed or FasterTransformer, the tensor parallel partitioning results in some sub-modules of the model being duplicated since not every module supports tensor parallelism. The level of module replication strongly depends on the model architecture.

  • For single GPU deployment, LMME = 1.20 × LBME.

  • For multiple GPUs deployment, LMME = multiplier × LBME, where the multiplier strongly depends on the model architecture, and the data type. For some popular architectures, we present the recommended multipliers in the following table:

    Model architecture fp32 fp16 int8
    GPT variants 1.45 1.55 1.55
    Bloom variants 1.5 1.55 1.55
    OPT variants 1.55 1.75 1.75
    Other models 1.5 1.5 1.5

Adjusting loaded model memory estimate for sequence length and batch size

During inference more memory is required for the input tensors, intermediate tensors, and output tensors, than what your LMME suggests. You also need to account for additional memory required as a result of partitioning the model.

The runtime model memory estimate (RMME), defined in the following equation, accounts for the additional memory requirement based on the sequence length (total number of input and output tokens):

RMME per GPU = multiplier × LMME / number of partitions

The recommended multipliers listed below are based on a sequence length of 1024. If you plan to use shorter or longer sequences you can further adjust these multipliers by multiplying them by sequence length / 1024.

  • GPT variants – 1.1 + 0.05 × (number of partitions - 1)

  • BLOOM variants – 1.15 + 0.05 × (number of partitions - 1)

  • OPT variants – 1.20 + 0.1 × (number of partitions - 1)

If you plan on accommodating batche sizes larger than 1, multiply the aforementioned multiplier by your expected batch size, to get the final multiplier for calculating RMME per GPU.

Finalizing possible instance types

With the RMME per GPU you can filter the list of available GPUs to ones that are likely to work for your use case. RMME per GPU is an intentionally cautious metric to account for the performance variance between different model architectures with different data types and engine framework versions. If you find that any instances are barely excluded by your estimate, we recommend trying them out.