Create Actions on Rules Using Amazon CloudWatch and AWS Lambda - Amazon SageMaker

Create Actions on Rules Using Amazon CloudWatch and AWS Lambda

Amazon CloudWatch collects Amazon SageMaker model training job logs and Amazon SageMaker Debugger rule processing job logs. Configure Debugger with Amazon CloudWatch Events and AWS Lambda to take action based on Debugger rule evaluation status.

CloudWatch Logs for Debugger Rules and Training Jobs

To find training job logs and Debugger rule job logs
  1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. In the left navigation pane under the Log node, choose Log Groups.

  3. In the log groups list, do the following:

    • Choose /aws/sagemaker/TrainingJobs for training job logs.

    • Choose /aws/sagemaker/ProcessingJobs for Debugger rule job logs.

You can use the training and Debugger rule job status in the CloudWatch logs to take further actions when there are training issues.

For more information about monitoring training jobs using CloudWatch, see Monitor Amazon SageMaker.

Set Up Debugger for Automated Training Job Termination Using CloudWatch and Lambda

The Debugger rules monitor training job status, and a CloudWatch Events rule watches the Debugger rule training job evaluation status.

Step 1: Create a Lambda Function

To create a Lambda function
  1. Open the AWS Lambda console at https://console.aws.amazon.com/lambda/.

  2. In the left navigation pane, choose Functions and then choose Create function.

  3. On the Create function page, choose Author from scratch option.

  4. In the Basic information section, enter a Function name (for example, debugger-rule-stop-training-job).

  5. For Runtime, choose Python 3.7.

  6. For Permissions, expand the drop down option, and choose Change default execution role.

  7. For Execution role, choose Use an existing role and choose the IAM role that you use for training jobs on SageMaker.

    Note

    Make sure you use the execution role with AmazonSageMakerFullAccess and AWSLambdaBasicExecutionRole attached. Otherwise, the Lambda function won't properly react to the Debugger rule status changes of the training job. If you are unsure which execution role is being used, run the following code in a Jupyter notebook cell to retrieve the execution role output:

    import sagemaker sagemaker.get_execution_role()
  8. At the bottom of the page, choose Create function.

The following figure shows an example of the Create function page with the input fields and selections completed.


                        Create Function page.

Step 2: Configure the Lambda function

To configure the Lambda function
  1. In the Function code section of the configuration page, paste the following Python script in the Lambda code editor pane. The lambda_handler function monitors the Debugger rule evaluation status collected by CloudWatch and triggers the StopTrainingJob API operation. The AWS SDK for Python (Boto3) client for SageMaker provides a high-level method, stop_training_job, which triggers the StopTrainingJob API operation.

    import json import boto3 import logging logger = logging.getLogger() logger.setLevel(logging.INFO) def lambda_handler(event, context): training_job_name = event.get("detail").get("TrainingJobName") logging.info(f'Evaluating Debugger rules for training job: {training_job_name}') eval_statuses = event.get("detail").get("DebugRuleEvaluationStatuses", None) if eval_statuses is None or len(eval_statuses) == 0: logging.info("Couldn't find any debug rule statuses, skipping...") return { 'statusCode': 200, 'body': json.dumps('Nothing to do') } # should only attempt stopping jobs with InProgress status training_job_status = event.get("detail").get("TrainingJobStatus", None) if training_job_status != 'InProgress': logging.debug(f"Current Training job status({training_job_status}) is not 'InProgress'. Exiting") return { 'statusCode': 200, 'body': json.dumps('Nothing to do') } client = boto3.client('sagemaker') for status in eval_statuses: logging.info(status.get("RuleEvaluationStatus") + ', RuleEvaluationStatus=' + str(status)) if status.get("RuleEvaluationStatus") == "IssuesFound": secondary_status = event.get("detail").get("SecondaryStatus", None) logging.info( f'About to stop training job, since evaluation of rule configuration {status.get("RuleConfigurationName")} resulted in "IssuesFound". ' + f'\ntraining job "{training_job_name}" status is "{training_job_status}", secondary status is "{secondary_status}"' + f'\nAttempting to stop training job "{training_job_name}"' ) try: client.stop_training_job( TrainingJobName=training_job_name ) except Exception as e: logging.error( "Encountered error while trying to " "stop training job {}: {}".format( training_job_name, str(e) ) ) raise e return None

    For more information about the Lambda code editor interface, see Creating functions using the AWS Lambda console editor.

  2. Skip all other settings and choose Save at the top of the configuration page.

Step 3: Create a CloudWatch Events Rule and Link to the Lambda Function for Debugger

To create a CloudWatch Events rule and link to the Lambda function for Debugger
  1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. In the left navigation pane, choose Rules under the Events node.

  3. Choose Create rule.

  4. In the Event Source section of the Step 1: Create rule page, choose SageMaker for Service Name, and choose SageMaker Training Job State Change for Event Type. The Event Pattern Preview should look like the following example JSON strings:

    { "source": [ "aws.sagemaker" ], "detail-type": [ "SageMaker Training Job State Change" ] }
  5. In the Targets section, choose Add target*, and choose the debugger-rule-stop-training-job Lambda function that you created. This step links the CloudWatch Events rule with the Lambda function.

  6. Choose Configure details and go to the Step 2: Configure rule details page.

  7. Specify the CloudWatch rule definition name. For example, debugger-cw-event-rule.

  8. Choose Create rule to finish.

  9. Go back to the Lambda function configuration page and refresh the page. Confirm that it's configured correctly in the Designer panel. The CloudWatch Events rule should be registered as a trigger for the Lambda function. The configuration design should look like the following example:

    
                                Designer panel for the CloudWatch configuration.

Run Example Notebooks to Test Automated Training Job Termination

You can run the following example notebooks, which are prepared for experimenting with stopping a training job using Debugger's built-in rules.

Disable the CloudWatch Events Rule to Stop Using the Automated Training Job Termination

If you want to disable the automated training job termination, you need to disable the CloudWatch Events rule. In the Lambda Designer panel, choose the EventBridge (CloudWatch Events) block linked to the Lambda function. This shows an EventBridge panel below the Designer panel (for example, see the previous screen shot). Select the check box next to EventBridge (CloudWatch Events): debugger-cw-event-rule, and then choose Disable. If you want to use the automated termination functionality later, you can enable the CloudWatch Events rule again.