GENREL06-BP01 Design for fault-tolerance for high-performance distributed computation tasks

Fault-tolerant infrastructure identifies issues in long-running, high-performance distributed computation tasks and remediates them before they can disrupt the task. Because these tasks are expensive and time-consuming, use fault-tolerant infrastructure to reliably perform model customization jobs.

Desired outcome: When implemented, this best practice improves the reliability of your model customization workloads, automating recovery during fine-tuning, pre-training, and other model customization workloads.

Benefits of establishing this best practice: Automatically recover from failure - Fault-tolerant infrastructure can automatically recover from failure, improving the reliability of long-running, high-performance, distributed computation tasks like model customization.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Model pre-training, continuous pre-training, fine-tuning, and distillation are some of the many high-performance distributed computation tasks sometimes required to optimize foundation models for generative AI workloads. These tasks require the orchestration of dozens or hundreds of virtual machines, running workloads over days, weeks, months or longer. These tasks are particularly susceptible to disruptions, which could delay or stop training progress. Consider a managed or automated process that provisions and orchestrates the infrastructure on your behalf, handles errors, and preserves the workload's integrity.

Amazon SageMaker AI HyperPod clusters allow customers to pre-train or fine-tune large language models using managed infrastructure. Amazon EC2 UltraClusters facilitate large language model hosting for purpose-built machine learning accelerators. Additionally, Amazon Bedrock offers managed fine-tuning, continuous pre-training, or model distillation for a selection of third-party models.

When implementing fault-tolerant distributed training manually, evaluate options that can recover the training and customization progress. Create training job recovery points by checkpointing model training. Keep track of training progress, and determine when to halt training based on observed metrics. Consider leveraging performant storage solutions (like Amazon FSx for Lustre) that provide distributed compute tasks rapid access to large data volumes at scale. Managed training and model customization solutions provide these capabilities, but you can also consider self-hosting for some model training and customization initiatives.

Implementation steps

In Amazon Bedrock, when using custom models:
- Select a model customization job like fine-tuning or continued pre-training.
- Follow the prompts to begin executing the job.
- Test the output once the job has completed.
Alternatively, provision SageMaker AI HyperPod or EC2 UltraClusters.
Configure object store for workload checkpointing.
Provision high performance Amazon FSx for Lustre containing your training and customization data.

Resources

Related practices:

Related guides, videos, and documentation:

Related examples:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Distributed compute tasks

Performance efficiency