To configure a Debugger rule for debugging model parameters To configure a Debugger built-in rule for profiling system and framework metrics Update Debugger profiling configuration using the UpdateTrainingJob API Add Debugger custom rule configuration to the CreateTrainingJob API

JSON (AWS CLI)

Amazon SageMaker Debugger built-in rules can be configured for a training job using the DebugHookConfig, DebugRuleConfiguration, ProfilerConfig, and ProfilerRuleConfiguration objects through the SageMaker AI CreateTrainingJob API operation. You need to specify the right image URI in the RuleEvaluatorImage parameter, and the following examples walk you through how to set up the JSON strings to request CreateTrainingJob.

The following code shows a complete JSON template to run a training job with required settings and Debugger configurations. Save the template as a JSON file in your working directory and run the training job using AWS CLI. For example, save the following code as debugger-training-job-cli.json.

Note

Ensure that you use the correct Docker container images. To find AWS Deep Learning Container images, see Available Deep Learning Containers Images. To find a complete list of available Docker images for using the Debugger rules, see Docker images for Debugger rules.


{
   "TrainingJobName": "debugger-aws-cli-test",
   "RoleArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-YYYYMMDDT123456",
   "AlgorithmSpecification": {
      // Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
      "TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04",
      "TrainingInputMode": "File",
      "EnableSageMakerMetricsTimeSeries": false
   },
   "HyperParameters": {
      "sagemaker_program": "entry_point/tf-hvd-train.py",
      "sagemaker_submit_directory": "s3://sagemaker-us-west-2-111122223333/debugger-boto3-profiling-test/source.tar.gz"
   },
   "OutputDataConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/output"
   },
   "DebugHookConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/debug-output",
      "CollectionConfigurations": [
         {
            "CollectionName": "losses",
            "CollectionParameters" : {
                "train.save_interval": "50"
            }
         }
      ]
   },
   "DebugRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "LossNotDecreasing",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "LossNotDecreasing"}
      }
   ],
   "ProfilerConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/profiler-output",
      "ProfilingIntervalInMilliseconds": 500,
      "ProfilingParameters": {
          "DataloaderProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\", }",
          "DetailedProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, }",
          "PythonProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cprofile\", \"cProfileTimer\": \"total_time\"}",
          "LocalPath": "/opt/ml/output/profiler/" 
      }
   },
   "ProfilerRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "ProfilerReport",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "ProfilerReport"}
      }
   ],
   "ResourceConfig": { 
      "InstanceType": "ml.p3.8xlarge",
      "InstanceCount": 1,
      "VolumeSizeInGB": 30
   },
   
   "StoppingCondition": { 
      "MaxRuntimeInSeconds": 86400
   }
}

After saving the JSON file, run the following command in your terminal. (Use ! at the beginning of the line if you use a Jupyter notebook.)


aws sagemaker create-training-job --cli-input-json file://debugger-training-job-cli.json

To configure a Debugger rule for debugging model parameters

The following code samples show how to configure a built-in VanishingGradient rule using this SageMaker API.

To enable Debugger to collect output tensors

Specify the Debugger hook configuration as follows:


"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "gradients",
            "CollectionParameters" : {
                "save_interval": "500"
            }
        }
    ]
}

This will make the training job save the tensor collection, gradients, every save_interval of 500 steps. To find available CollectionName values, see Debugger Built-in Collections in the SMDebug client library documentation. To find available CollectionParameters parameter keys and values, see the sagemaker.debugger.CollectionConfig class in the SageMaker Python SDK documentation.

To enable Debugger rules for debugging the output tensors

The following DebugRuleConfigurations API example shows how to run the built-in VanishingGradient rule on the saved gradients collection.


"DebugRuleConfigurations": [
    {
        "RuleConfigurationName": "VanishingGradient",
        "RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "VanishingGradient",
            "threshold": "20.0"
        }
    }
]

With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training job using the VanishingGradient rule on the collection of gradients tensor. To find a complete list of available Docker images for using the Debugger rules, see Docker images for Debugger rules. To find the key-value pairs for RuleParameters, see List of Debugger built-in rules.

To configure a Debugger built-in rule for profiling system and framework metrics

The following example code shows how to specify the ProfilerConfig API operation to enable collecting system and framework metrics.

To enable Debugger profiling to collect system and framework metrics

Target Step


"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500, 
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/" 
    }
}

Target Time Duration


"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500,
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/"  
    }
}

To enable Debugger rules for profiling the metrics

The following example code shows how to configure the ProfilerReport rule.


"ProfilerRuleConfigurations": [ 
    {
        "RuleConfigurationName": "ProfilerReport",
        "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "ProfilerReport",
            "CPUBottleneck_cpu_threshold": "90",
            "IOBottleneck_threshold": "90"
        }
    }
]

To find a complete list of available Docker images for using the Debugger rules, see Docker images for Debugger rules. To find the key-value pairs for RuleParameters, see List of Debugger built-in rules.

Update Debugger profiling configuration using the `UpdateTrainingJob` API

Debugger profiling configuration can be updated while your training job is running by using the UpdateTrainingJob API operation. Configure new ProfilerConfig and ProfilerRuleConfiguration objects, and specify the training job name to the TrainingJobName parameter.


{
    "ProfilerConfig": { 
        "DisableProfiler": boolean,
        "ProfilingIntervalInMilliseconds": number,
        "ProfilingParameters": { 
            "string" : "string" 
        }
    },
    "ProfilerRuleConfigurations": [ 
        { 
            "RuleConfigurationName": "string",
            "RuleEvaluatorImage": "string",
            "RuleParameters": { 
                "string" : "string" 
            }
        }
    ],
    "TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
}

Add Debugger custom rule configuration to the `CreateTrainingJob` API

A custom rule can be configured for a training job using the DebugHookConfig and DebugRuleConfiguration objects in the CreateTrainingJob API operation. The following code sample shows how to configure a custom ImproperActivation rule written with the smdebug library using this SageMaker API operation. This example assumes that you’ve written the custom rule in custom_rules.py file and uploaded it to an Amazon S3 bucket. The example provides pre-built Docker images that you can use to run your custom rules. These are listed at Amazon SageMaker Debugger image URIs for custom rule evaluators. You specify the URL registry address for the pre-built Docker image in the RuleEvaluatorImage parameter.


"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "relu_activations",
            "CollectionParameters": {
                "include_regex": "relu",
                "save_interval": "500",
                "end_step": "5000"
            }
        }
    ]
},
"DebugRulesConfigurations": [
    {
        "RuleConfigurationName": "improper_activation_job",
        "RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest",
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 400,
        "RuleParameters": {
           "source_s3_uri": "s3://bucket/custom_rules.py",
           "rule_to_invoke": "ImproperActivation",
           "collection_names": "relu_activations"
        }
    }
]

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Configure Debugger using SageMaker API

SDK for Python (Boto3)