Logs for Built-in Algorithms - Amazon SageMaker

Logs for Built-in Algorithms

Amazon SageMaker algorithms produce Amazon CloudWatch logs, which provide detailed information on the training process. To see the logs, in the AWS management console, choose CloudWatch, choose Logs, and then choose the /aws/sagemaker/TrainingJobs log group. Each training job has one log stream per node on which it was trained. The log stream’s name begins with the value specified in the TrainingJobName parameter when the job was created.

Note

If a job fails and logs do not appear in CloudWatch, it's likely that an error occurred before the start of training. Reasons include specifying the wrong training image or S3 location.

The contents of logs vary by algorithms. However, you can typically find the following information:

  • Confirmation of arguments provided at the beginning of the log

  • Errors that occurred during training

  • Measurement of an algorithm's accuracy or numerical performance

  • Timings for the algorithm and any major stages within the algorithm

Common Errors

If a training job fails, some details about the failure are provided by the FailureReason return value in the training job description, as follows:

sage = boto3.client('sagemaker') sage.describe_training_job(TrainingJobName=job_name)['FailureReason']

Others are reported only in the CloudWatch logs. Common errors include the following:

  1. Misspecifying a hyperparameter or specifying a hyperparameter that is invalid for the algorithm.

    From the CloudWatch Log

    [10/16/2017 23:45:17 ERROR 139623806805824 train.py:48] Additional properties are not allowed (u'mini_batch_siz' was unexpected)
  2. Specifying an invalid value for a hyperparameter.

    FailureReason

    AlgorithmError: u'abc' is not valid under any of the given schemas\n\nFailed validating u'oneOf' in schema[u'properties'][u'feature_dim']:\n {u'oneOf': [{u'pattern': u'^([1-9][0-9]*)$', u'type': u'string'},\n {u'minimum': 1, u'type': u'integer'}]}\

    FailureReason

    [10/16/2017 23:57:17 ERROR 140373086025536 train.py:48] u'abc' is not valid under any of the given schemas
  3. Inaccurate protobuf file format.

    From the CloudWatch log

    [10/17/2017 18:01:04 ERROR 140234860816192 train.py:48] cannot copy sequence with size 785 to array axis with dimension 784