Amazon SageMaker
Developer Guide

Managed Spot Training in Amazon SageMaker

Amazon SageMaker makes it easy to train machine learning models using managed Amazon EC2 Spot instances. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. Amazon SageMaker manages the Spot interruptions on your behalf.

Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using Amazon EC2 Spot instances. Metrics and logs generated during training runs are available in CloudWatch.

Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. Amazon SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, Amazon SageMaker copies the data from Amazon S3 back into the local path. The training can then resume from the last checkpoint instead of restarting. For more information about checkpointing, see Using Checkpoints in Amazon SageMaker.

Note

Unless your training job will complete quickly, we recommend you use checkpointing with managed spot training. SageMaker built-in algorithms and marketplace algorithms that do not checkpoint are currently limited to a MaxWaitTimeInSeconds of 3600 seconds (60 minutes).

Using Managed Spot Training

To use managed spot training, create a training job. Set EnableManagedSpotTraining to True and specify the MaxWaitTimeInSecods. MaxWaitTimeInSeconds must be larger than MaxRuntimeInSeconds. For more information about creating a training job, see CreateTrainingJob.

You can calculate the savings from using managed spot training using the formula (1 - BillableTimeInSeconds / TrainingTimeInSeconds) * 100. For example, if BillableTimeInSeconds is 100 and TrainingTimeInSeconds is 500, the savings is 80%.

Managed Spot Training Lifecycle

You can monitor a training job using TrainingJobStatus and SecondaryStatus returned by DescribeTrainingJob. The list below shows how TrainingJobStatus and SecondaryStatus values change depending on the training scenario:

  • Spot instances acquired with no interruption during training

    1. InProgress: StartingDownloadingTrainingUploading

  • Spot instances interrupted once. Later, enough spot instances were acquired to finish the training job.

    1. InProgress: StartingDownloadingTrainingInterruptedStartingDownloadingTrainingUploading

  • Spot instances interrupted twice and MaxWaitTimeInSeconds exceeded.

    1. InProgress: StartingDownloadingTrainingInterruptedStartingDownloadingTrainingInterruptedDownloadingTraining

    2. Stopping: Stopping

    3. Stopped: MaxWaitTimeExceeded

  • Spot instances were never launched.

    1. InProgress: Starting

    2. Stopping: Stopping

    3. Stopped: MaxWaitTimeExceeded