Use the SageMaker and Debugger Configuration API Operations to Create, Update, Debug, and Profile Your Training Job - Amazon SageMaker

Use the SageMaker and Debugger Configuration API Operations to Create, Update, Debug, and Profile Your Training Job

The preceding topics focuses on using Debugger through Amazon SageMaker Python SDK, which is a wrapper around AWS boto3 API operations for SageMaker. This offers a high-level experience of accessing the Amazon SageMaker API operations. In case you need to use the SageMaker API operations directly with other SDKs, such as Java, Go, C++, and many others, the following topics cover how to configure CreateTrainingJob, UpdateTrainingJob, Debugger configuration APIs, and their parameters to use the Debugger built-in and custom rules.

Add Debugger Built-in Rule Configuration to the CreateTrainingJob API Operation

Amazon SageMaker Debugger built-in rules can be configured for a training job using the DebugHookConfig, DebugRuleConfiguration, ProfilerConfig, and ProfilerRuleConfiguration objects in the CreateTrainingJob API operation. The built-in and custom rules run in processing containers, and you can find the ECR image URIs at the Use Debugger Docker Images for Built-in or Custom Rules topic. You need to specify the right image URI in the RuleEvaluatorImage parameter, and the following examples walk you through how to set up the JSON strings to request CreateTrainingJob.

To configure a Debugger rule for debugging model parameters

The following code samples show how to configure a built-in VanishingGradient rule using this SageMaker API.

Specify the Debugger hook configuration as follows:

DebugHookConfig: { "S3OutputPath": "s3://bucket/path-to-tensors", "CollectionConfigurations": [ { "CollectionName": "gradients", "CollectionParameters" : { "save_interval": "500" } } ] }

This will make the training job save the tensor collection, gradients, every save_interval of 500 steps. The following code example of the DebugRuleConfigurations API demonstrates how to run the built-in VanishingGradient rule on the saved gradients collection.

DebugRuleConfigurations: [ { "RuleConfigurationName": "VanishingGradient", "RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest", "RuleParameters": { "rule_to_invoke": "VanishingGradient", "threshold": "20.0" } } ]

With a configuration like the one in this sample, Amazon SageMaker Debugger starts a rule evaluation job for your training job using the SageMaker VanishingGradient rule on the collection of gradients tensor.

To configure a Debugger built-in rule for profiling system and framework metrics

The following example code shows how to specify the ProfilerConfig API operation to enable collecting system and framework metrics.

Target Step
"ProfilerConfig": { "ProfilingIntervalInMilliseconds": 500, "ProfilingParameters": { "DetailedProfilingConfig": { \"StartStep\": \"7\", // The default is the first step (step 0) \"NumSteps\": \"1\", // The default value is 1 }, "PythonProfilingConfig": { \"StartStep\": \"9\", // The default is the first step (step 0) \"NumSteps\": \"3\", // The default value is 3 \"cProfileTimer\": \"total_time\", // Available options: cpu, off_cpu, total_time \"ProfilerName\" : \"cProfile\", // Available options: cProfile, Pyinstrument }, "DataLoaderProfilingConfig": { \"StartStep\": \"5\", // The default is the first step (step 0) \"NumSteps\": \"1\" // The default value is 1 } } }
Target Time Duration
"ProfilerConfig": { "ProfilingIntervalInMilliseconds": 500, "ProfilingParameters": { "DetailedProfilingConfig": { \"StartTimeInSecSinceEpoch\": \"12345567789\", // The default is the current time \"DurationInSeconds\": \"1\", // The default is duration of 1 step }, "PythonProfilingConfig": { \"StartTimeInSecSinceEpoch\": \"12345567789\", // The default is the current time \"DurationInSeconds\": \"1\", // The default is duration of 1 step \"cProfileTimer\": \"total_time\", // Available options: cpu, off_cpu, total_time \"ProfilerName\" : \"cProfile\", // Available options: cProfile, Pyinstrument }, "DataLoaderProfilingConfig": { \"StartTimeInSecSinceEpoch\": \"12345567789\", // The default is the current time \"DurationInSeconds\": \"1\" // The default is duration of 1 step } } }

The following example code shows how to configure the ProfilerReport rule.

"ProfilerRuleConfigurations": [ { "RuleConfigurationName": "ProfilerReport", "RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest", "RuleParameters": { "rule_to_invoke": "ProfilerReport", "CPUBottleneck_cpu_threshold": "90", "IOBottleneck_threshold": "90" } } ]

Update Debugger Profiling Configuration Using the UpdateTrainingJob API Operation

Debugger profiling configuration can be updated while your training job is running by using the UpdateTrainingJob API operation. Configure new ProfilerConfig and ProfilerRuleConfiguration objects, and specify the training job name to the TrainingJobName parameter.

{ "ProfilerConfig": { "DisableProfiler": boolean, "ProfilingIntervalInMilliseconds": number, "ProfilingParameters": { "string" : "string" } }, "ProfilerRuleConfigurations": [ { "RuleConfigurationName": "string", "RuleEvaluatorImage": "string", "RuleParameters": { "string" : "string" } } ], "TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS" }

Add Debugger Custom Rule Configuration to the CreateTrainingJob API Operation

A custom rule can be configured for a training job using the DebugHookConfig and DebugRuleConfiguration objects in the CreateTrainingJob API operation. The following code sample shows how to configure a custom ImproperActivation rule written with the smdebug library using this SageMaker API operation. This example assumes that you’ve written the custom rule in custom_rules.py file and uploaded it to an Amazon S3 bucket. The example provides pre-built Docker images that you can use to run your custom rules. These are listed at Amazon SageMaker Debugger Registry URLs for Custom Rule Evaluators. You specify the URL registry address for the pre-built Docker image in the RuleEvaluatorImage parameter.

DebugHookConfig: { "S3OutputPath": "s3://bucket/", "CollectionConfigurations": [ { "CollectionName": "relu_activations", "CollectionParameters": { "include_regex": "relu", "save_interval": "500", "end_step": "5000" } } ] }, DebugRulesConfigurations: [ { "RuleConfigurationName": "improper_activation_job", "RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest", "InstanceType": "ml.c4.xlarge", "VolumeSizeInGB": 400, "RuleParameters": { "source_s3_uri": "s3://bucket/custom_rules.py", "rule_to_invoke": "ImproperActivation", "collection_names": "relu_activations" } } ]