Hugging Face Transformer models compatible with SMP expert parallelism Configure expert parallelism

Expert parallelism

A Mixture of Experts (MoE) model is a type of transformer model that employs a sparse approach, making it lighter for training compared to training traditional dense models. In this MoE neural network architecture, only a subset of the model's components called experts are utilized for each input. This approach offers several advantages, including more efficient training and faster inference, even with a larger model size. In other words, with the same compute budget for training a full dense model, you can fit a larger model or dataset when using MoE.

An MoE model consists of multiple experts, each consisting of a neural network, typically a feed-forward network (FFN). A gate network called router determines which tokens are sent to which expert. These experts specialize in processing specific aspects of the input data, enabling the model to train faster, reduce compute cost, while achieving the same performance quality as its counterpart dense model. To learn more about Mixture of Experts in general, refer to the blog Applying Mixture of Experts in LLM Architectures in the NVIDIA developer website.

Expert parallelism is a type of parallelism that handles splitting experts of an MoE model across GPU devices.

SMP v2 integrates with NVIDIA Megatron for implementing expert parallelism to support training MoE models, and runs on top of PyTorch FSDP APIs. You keep using your PyTorch FSDP training code as is and activate SMP expert parallelism for training MoE models.

Hugging Face Transformer models compatible with SMP expert parallelism

SMP v2 currently offers expert parallelism support for the following Hugging Face transformer models.

Mixtral

Configure expert parallelism

For expert_parallel_degree, you select a value for the degree of expert parallelism. The value must evenly divide the number of GPUs in your cluster. For example, to shard your model while using an instance with 8 GPUs, choose 2, 4, or 8. We recommend that you start with a small number, and gradually increase it until the model fits in the GPU memory.

The following code snippets show how to add the SMP initialization module torch.sagemaker.init() to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in Use the SageMaker model parallelism library v2. You don’t need to make any changes to your PyTorch model or PyTorch FSDP configuration. For more information about the expert_parallel_degree parameter, see SMP v2 core feature configuration parameters.

Note

You can use expert parallelism with Hybrid sharded data parallelism. Note that expert parallelism is currently not compatible with tensor parallelism.

Note

This expert parallelism training feature is available in the following combination of libraries of SageMaker and the PyTorch library:

SMP v2.3.0 and later
The SageMaker Python SDK v2.214.4 and later
PyTorch v2.2.0 and later

In your training script

As part of Step 1, initialize your script with torch.sagemaker.init() to activate SMP v2 and wrap your model with the torch.sagemaker.transform API, adding the config parameter to the API to activate MoE. The following code snippet shows how to activate SMP MoE for the generic model class AutoModelForCausalLM pulling an MoE transformer model configuration using the from_config method for training from scratch, or the from_pretrained method for fine-tuning. To learn more about the SMP MoEConfig class, see torch.sagemaker.moe.moe_config.MoEConfig.


# Import the torch.sagemaker.transform API and initialize.
import torch.sagemaker as tsm
tsm.init()

# Import transformers AutoModelForCausalLM class.
from transformers import AutoModelForCausalLM

# Import the SMP-implementation of MoE configuration class.
from torch.sagemaker.moe.moe_config import MoEConfig

# Define a transformer model with an MoE model configuration
model = AutoModelForCausalLM.from_config(MoEModelConfig)

# Wrap it by torch.sagemaker.transform with the SMP MoE configuration.
model = tsm.transform(
    model, 
    config=MoEConfig(
        smp_moe=True,
        random_seed=12345,
        moe_load_balancing="sinkhorn",
        global_token_shuffle=False,
        moe_all_to_all_dispatcher=True,
        moe_aux_loss_coeff=0.001,
        moe_z_loss_coeff=0.001
    )
)

SMP configuration

As part of Step 2, add the following parameter to the SMP configuration dictionary for the SageMaker PyTorch estimator.


{   
    ..., # other SMP config parameters
    "expert_parallel_degree": 8
}

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Hybrid sharded data parallelism

Context parallelism