ClusterTieredStorageConfig
Defines the configuration for managed tier checkpointing in a HyperPod cluster. Managed tier checkpointing uses multiple storage tiers, including cluster CPU memory, to provide faster checkpoint operations and improved fault tolerance for large-scale model training. The system automatically saves checkpoints at high frequency to memory and periodically persists them to durable storage, like Amazon S3.
Contents
- Mode
-
Specifies whether managed tier checkpointing is enabled or disabled for the HyperPod cluster. When set to
Enable
, the system installs a memory management daemon that provides disaggregated memory as a service for checkpoint storage. When set toDisable
, the feature is turned off and the memory management daemon is removed from the cluster.Type: String
Valid Values:
Enable | Disable
Required: Yes
- InstanceMemoryAllocationPercentage
-
The percentage (int) of cluster memory to allocate for checkpointing.
Type: Integer
Valid Range: Minimum value of 0. Maximum value of 100.
Required: No
See Also
For more information about using this API in one of the language-specific AWS SDKs, see the following: