SageMaker HyperPod through Slurm orhcestration provides the following cluster resiliency features.
Health monitoring agent
Automatic node recovery and auto-resume
Manually replace or reboot a node
Javascript is disabled or is unavailable in your browser.
To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.
Thanks for letting us know we're doing a good job!
If you've got a moment, please tell us what we did right so we can do more of it.
Thanks for letting us know this page needs work. We're sorry we let you down.
If you've got a moment, please tell us how we can make the documentation better.