Train Using SageMaker Managed Warm Pools - Amazon SageMaker

Train Using SageMaker Managed Warm Pools

SageMaker managed warm pools let you retain and reuse provisioned infrastructure after the completion of a training job to reduce latency for repetitive workloads, such as iterative experimentation or running many jobs consecutively. Subsequent training jobs that match specified parameters run on the retained warm pool infrastructure, which speeds up start times by reducing the time spent provisioning resources.

Important

SageMaker managed warm pools are a billable resource. For more information, see Billing.

How it works

To use SageMaker managed warm pools and reduce latency between similar consecutive training jobs, create a training job that specifies a KeepAlivePeriodInSeconds value in its ResourceConfig. This value represents the duration of time in seconds to retain configured resources in a warm pool for subsequent training jobs. If you need to run several training jobs using similar configurations, you can further reduce latency and billable time by using a dedicated persistent cache directory to store and re-use your information in a different job.

Warm pool lifecycle

  1. Create an initial training job with a KeepAlivePeriodInSeconds value greater than 0. When you run this first training job, this “cold-starts” a cluster with typical startup times.

  2. When the first training job completes, the provisioned resources are kept alive in a warm pool for the period specified in the KeepAlivePeriodInSeconds value. As long as the cluster is healthy and the warm pool is within the specified KeepAlivePeriodInSeconds, then the warm pool status is Available.

  3. The warm pool stays Available until it either identifies a matching training job for reuse or it exceeds the specified KeepAlivePeriodInSeconds and is terminated. The maximum length of time allowed for the KeepAlivePeriodInSeconds is 3600 seconds (60 minutes). If the warm pool status is Terminated, then this is the end of the warm pool lifecycle.

  4. If the warm pool identifies a second training job with matching specifications such as instance count or instance type, then the warm pool moves from the first training job to the second training job for reuse. The status of the first training job warm pool becomes Reused. This is the end of the warm pool lifecycle for the first training job.

  5. The status of the second training job that reused the warm pool becomes InUse. After the second training job completes, the warm pool is Available for the KeepAlivePeriodInSeconds duration specified in the second training job. A warm pool can continue moving to subsequent matching training jobs for a maximum of 28 days.

  6. If the warm pool is no longer available to reuse, the warm pool status is Terminated. Warm pools are no longer available if they are terminated by a user, for a patch update, or for exceeding the specified KeepAlivePeriodInSeconds.

For more information on warm pool status options, see WarmPoolStatus in the Amazon SageMaker API Reference.

Warm pool creation

If an initial training job successfully completes and has a KeepAlivePeriodInSeconds value greater than 0, this creates a warm pool. If you stop a training job after a cluster is already launched, a warm pool is still retained. If the training job fails due to an algorithm or client error, a warm pool is still retained. If the training job fails for any other reason that might compromise the health of the cluster, then the warm pool is not created.

To verify successful warm pool creation, check the warm pool status of your training job. If a warm pool successfully provisions, the warm pool status is Available. If a warm pool fails to provision, the warm pool status is Terminated.

Matching training jobs

For a warm pool to persist, it must find a matching training job within the time specified in the KeepAlivePeriodInSeconds value. The next training job is a match if the following values are identical:

  • RoleArn

  • ResourceConfig values:

    • InstanceCount

    • InstanceType

    • VolumeKmsKeyId

    • VolumeSizeInGB

  • VpcConfig values:

    • SecurityGroupIds

    • Subnets

  • EnableInterContainerTrafficEncryption

  • EnableNetworkIsolation

  • If you passed session tags for your training job with EnableSessionTagChaining set to True in the training job's SessionChainingConfig, then a matching training job must also set EnableSessionTagChaining to True and have identical session keys. For more information, see Attribute-based access control (ABAC) for multi-tenancy training.

All of these values must be the same for a warm pool to move to a subsequent training job for reuse.

Maximum warm pool duration

The maximum KeepAlivePeriodInSeconds for a single training job is 3600 seconds (60 minutes) and the maximum length of time that a warm pool cluster can continue running consecutive training jobs is 28 days.

Each subsequent training job must also specify a KeepAlivePeriodInSeconds value. When the warm pool moves to the next training job, it inherits the new KeepAlivePeriodInSeconds value specified in that training job’s ResourceConfig. In this way, you can keep a warm pool moving from training job to training job for a maximum of 28 days.

If no KeepAlivePeriodInSeconds is specified, then the warm pool spins down after the training job completes.

Using persistent cache

When you create a warm pool, SageMaker mounts a special directory on the volume that will persist throughout the lifecycle of the warm pool. This directory can also be used to store information that you want to re-use in another job.

Using persistent cache can reduce latency and billable time over using warm pools alone for jobs that require the following:

  • multiple interactions with similar configurations

  • incremental training jobs

  • hyperparameter optimization

For example, you can avoid downloading the same Python dependencies on repeated runs by setting up a pip cache directory inside the persistent cache directory. You are fully responsible for managing the contents of this directory. The following are examples of types of information that you can put in your persistent cache to help reduce your latency and billable time.

  • Dependencies managed by pip.

  • Dependencies managed by conda.

  • Checkpoint information.

  • Any additional information generated during training.

The location of the persistent cache is /opt/ml/sagemaker/warmpoolcache. The environment variable SAGEMAKER_MANAGED_WARMPOOL_CACHE_DIRECTORY points to the location of the persistent cache directory.

The following code example shows you how to set up a warm pool and use persistent cache to store your pip dependencies for use in a subsequent job. The subsequent job must run within the time frame given by the parameter keep_alive_period_in_seconds.

import sagemakerfrom sagemaker import get_execution_rolefrom sagemaker.tensorflow import TensorFlow # Creates a SageMaker session and gets execution role session = sagemaker.Session() role = get_execution_role() # Creates an example estimator estimator = TensorFlow( ... entry_point='my-training-script.py', source_dir='code', role=role, model_dir='model_dir', framework_version='2.2', py_version='py37', job_name='my-training-job-1', instance_type='ml.g4dn.xlarge', instance_count=1, volume_size=250, hyperparameters={ "batch-size": 512, "epochs": 1, "learning-rate": 1e-3, "beta_1": 0.9, "beta_2": 0.999, }, keep_alive_period_in_seconds=1800, environment={"PIP_CACHE_DIR": "/opt/ml/sagemaker/warmpoolcache/pip"} )

In the previous code example, using the environment parameter exports the environment variable PIP_CACHE_DIRECTORY to point to the directory /opt/ml/sagemaker/warmpoolcache/pip. Exporting this environment variable will change where pip stores its cache to the new location. Any directory, including nested directories, that you create inside the persistent cache directory will be available for re-use during a subsequent training run. In the previous code example, a directory called pip is changed to be the default location to cache any dependencies installed using pip.

The persistent cache location may also be accessed from within your Python training script using the environment variable as shown in the following code example.

import os import shutil if __name__ == '__main__': PERSISTED_DIR = os.environ["SAGEMAKER_MANAGED_WARMPOOL_CACHE_DIRECTORY"] # create a file to be persisted open(os.path.join(PERSISTED_DIR, "test.txt"), 'a').close() # create a directory to be persisted os.mkdir(os.path.join(PERSISTED_DIR, "test_dir")) # Move a file to be persisted shutil.move("path/of/your/file.txt", PERSISTED_DIR)

Billing

SageMaker managed warm pools are a billable resource. Retrieve the warm pool status for your training job to check the billable time for your warm pools. You can check the warm pool status either through the Using the Amazon SageMaker console or directly through the DescribeTrainingJob API command. For more information, see WarmPoolStatus in the Amazon SageMaker API Reference.

Note

After the time specified by the parameter KeepAlivePeriodInSeconds has ended, both the warm pool and persistent cache will shut down, and the contents will be deleted.

Warm pool resource limits

To get started, you must first request a service limit increase for SageMaker managed warm pools. The default resource limit for warm pools is 0.

If a training job is created with KeepAlivePeriodInSeconds specified, but you did not request a warm pool limit increase, then a warm pool is not retained after the completion of the training job. A warm pool is only created if your warm pool limit has sufficient resources. After a warm pool is created, the resources are released when they move to a matching training job or if the KeepAlivePeriodInSeconds expires (if the warm pool status is Reused or Terminated).

Request a warm pool quota increase

Request a warm pool quota increase using the AWS Service Quotas console.

Note

All warm pool instance usage counts toward your SageMaker training resource limit. Increasing your warm pool resource limit does not increase your instance limit, but allocates a subset of your resource limit to warm pool training.

  1. Open the AWS Service Quotas console.

  2. On the left-hand navigation panel, choose AWS services.

  3. Search for and choose Amazon SageMaker.

  4. Search for the keyword warm pool to see all available warm pool service quotas.

  5. Find the instance type for which you want to increase your warm pool quota, select the warm pool service quota for that instance type, and choose Request quota increase.

  6. Enter your requested instance limit number under Change quota value. The new value must be greater than the current Applied quota value.

  7. Choose Request.

There is a limit on the number of instances that you can retain for each account, which is determined by instance type. You can check your resource limits in the AWS Service Quotas console or directly using the list-service-quotas AWS CLI command. For more information on AWS Service Quotas, see Requesting a quota increase in the Service Quotas User Guide.

You can also use AWS Support Center to request a warm pool quota increase. For a list of available instance types according to Region, see Amazon SageMaker Pricing and choose Training in the On-Demand Pricing table.

How to use SageMaker managed warm pools

You can use SageMaker managed warm pools through the SageMaker Python SDK, the Amazon SageMaker console, or through the low-level APIs. Administrators can optionally use the sagemaker:KeepAlivePeriod condition key to further restrict the KeepAlivePeriodInSeconds limits for certain users or groups.

Using the SageMaker Python SDK

Create, update, or terminate warm pools using the SageMaker Python SDK.

Note

This feature is available in the SageMaker Python SDK v2.110.0 and later.

Create a warm pool

To create a warm pool, use the SageMaker Python SDK to create an estimator with a keep_alive_period_in_seconds value greater than 0 and call fit(). When the training job completes, a warm pool is retained. For more information on training scripts and estimators, see Train a Model with the SageMaker Python SDK. If your script does not create a warm pool, see Warm pool creation for possible explanations.

import sagemaker from sagemaker import get_execution_role from sagemaker.tensorflow import TensorFlow # Creates a SageMaker session and gets execution role session = sagemaker.Session() role = get_execution_role() # Creates an example estimator estimator = TensorFlow( ... entry_point='my-training-script.py', source_dir='code', role=role, model_dir='model_dir', framework_version='2.2', py_version='py37', job_name='my-training-job-1', instance_type='ml.g4dn.xlarge', instance_count=1, volume_size=250, hyperparameters={ "batch-size": 512, "epochs": 1, "learning-rate": 1e-3, "beta_1": 0.9, "beta_2": 0.999, }, keep_alive_period_in_seconds=1800, ) # Starts a SageMaker training job and waits until completion estimator.fit('s3://my_bucket/my_training_data/')

Next, create a second matching training job. In this example, we create my-training-job-2, which has all of the necessary attributes to match with my-training-job-1, but has a different hyperparameter for experimentation. The second training job reuses the warm pool and starts up faster than the first training job. The following code example uses a Tensorflow estimator. The warm pool feature can be used with any training algorithm that runs on Amazon SageMaker. For more information on which attributes need to match, see Matching training jobs.

# Creates an example estimator estimator = TensorFlow( ... entry_point='my-training-script.py', source_dir='code', role=role, model_dir='model_dir', framework_version='py37', py_version='pyxy', job_name='my-training-job-2', instance_type='ml.g4dn.xlarge', instance_count=1, volume_size=250, hyperparameters={ "batch-size": 512, "epochs": 2, "learning-rate": 1e-3, "beta_1": 0.9, "beta_2": 0.999, }, keep_alive_period_in_seconds=1800, ) # Starts a SageMaker training job and waits until completion estimator.fit('s3://my_bucket/my_training_data/')

Check the warm pool status of both training jobs to confirm that the warm pool is Reused for my-training-job-1 and InUse for my-training-job-2.

Note

Training job names have date/time suffixes. The example training job names my-training-job-1 and my-training-job-2 should be replaced with actual training job names. You can use the estimator.latest_training_job.job_name command to fetch the actual training job name.

session.describe_training_job('my-training-job-1') session.describe_training_job('my-training-job-2')

The result of describe_training_job provides all details about a given training job. Find the WarmPoolStatus attribute to check information about a training job’s warm pool. Your output should look similar to the following example:

# Warm pool status for training-job-1 ... 'WarmPoolStatus': {'Status': 'Reused', 'ResourceRetainedBillableTimeInSeconds': 1000, 'ReusedByName': my-training-job-2} ... # Warm pool status for training-job-2 ... 'WarmPoolStatus': {'Status': 'InUse'} ...

Update a warm pool

When the training job is complete and the warm pool status is Available, then you can update the KeepAlivePeriodInSeconds value.

session.update_training_job(job_name, resource_config={"KeepAlivePeriodInSeconds":3600})

Terminate a warm pool

To manually terminate a warm pool, set the KeepAlivePeriodInSeconds value to 0.

session.update_training_job(job_name, resource_config={"KeepAlivePeriodInSeconds":0})

The warm pool automatically terminates when it exceeds the designated KeepAlivePeriodInSeconds value or if there is a patch update for the cluster.

Using the Amazon SageMaker console

Through the console, you can create a warm pool, release a warm pool, or check the warm pool status and billable time of specific training jobs. You can also see which matching training job reused a warm pool.

  1. Open the Amazon SageMaker console and choose Training jobs from the navigation pane. If applicable, the warm pool status of each training job is visible in the Warm pool status column and the time left for an active warm pool is visible in the Time left column.

  2. To create a training job that uses a warm pool from the console, choose Create training job. Then, be sure to specify a value for the Keep alive period field when configuring your training job resources. This value must be an integer between 1 and 3600, which represents duration of time in seconds.

  3. To release a warm pool from the console, select a specific training job and choose Release cluster from the Actions dropdown menu.

  4. To see more information about a warm pool, choose a training job name. In the job details page, scroll down to the Warm pool status section to find the warm pool status, the time left if the warm pool status is Available, the warm pool billable seconds, and the name of the training job that reused the warm pool if the warm pool status is Reused.

Using the low-level SageMaker APIs

Use SageMaker managed warm pools with either the SageMaker API or the AWS CLI.

SageMaker API

Set up SageMaker managed warm pools using the SageMaker API with the following commands:

AWS CLI

Set up SageMaker managed warm pools using the AWS CLI with the following commands:

IAM condition key

Administrators can optionally use the sagemaker:KeepAlivePeriod condition key to further restrict the KeepAlivePeriodInSeconds limits for certain users or groups. SageMaker managed warm pools are limited to a KeepAlivePeriodInSeconds value of 3600 seconds (60 minutes), but administrators can lower this limit if needed.

{ "Version": "2012-10-17", "Statement": [ { "Sid": "EnforceKeepAlivePeriodLimit", "Effect": "Allow", "Action": [ "sagemaker:CreateTrainingJob" ], "Resource": "*", "Condition": { "NumericLessThanIfExists": { "sagemaker:KeepAlivePeriod": 1800 } } } ] }

For more information, see Condition keys for Amazon SageMaker in the Service Authorization Reference.

Considerations

Consider the following items when using SageMaker managed warm pools.

  • SageMaker managed warm pools cannot be used with heterogeneous cluster training.

  • SageMaker managed warm pools cannot be used with spot instances.

  • SageMaker managed warm pools are limited to a KeepAlivePeriodInSeconds value of 3600 seconds (60 minutes).

  • If a warm pool continues to successfully match training jobs within the specified KeepAlivePeriodInSeconds value, the cluster can only continue running for a maximum of 28 days.