Logging and monitoring - SageMaker Studio Administration Best Practices

Logging and monitoring

To help you debug your compilation jobs, processing jobs, training jobs, endpoints, transform jobs, notebook instances, and notebook instance lifecycle configurations, anything that an algorithm container, a model container, or a notebook instance lifecycle configuration sends to stdout or stderr is also sent to Amazon CloudWatch Logs. You can monitor SageMaker Studio using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. These statistics are kept for 15 months, so you can access historical information and gain a better perspective on how your web application or service is performing.

Logging with CloudWatch

As the data science process is inherently experimental and iterative, it is essential to log activity such as notebook usage, training/processing job run time, training metrics, and endpoint serving metrics such as invocation latency. By default, SageMaker publishes metrics to CloudWatch Logs, and these logs can be encrypted with customer-managed keys using AWS KMS.

You can also use VPC endpoints to send logs to CloudWatch without using the public internet. You can also set alarms that watch for certain thresholds, and send notifications or take actions when those thresholds are met. For more information, refer to the Amazon CloudWatch User Guide.

SageMaker creates a single log group for Studio, under /aws/sagemaker/studio. Each user profile and app has their own log stream under this log group, and lifecycle configuration scripts have their own log stream as well. For example, a user profile named ‘studio-user’ with a Jupyter Server app and with an attached lifecycle script, and a Data Science Kernel Gateway app has the following log streams:

/aws/sagemaker/studio/<domain-id>/studio-user/JupyterServer/default

/aws/sagemaker/studio/<domain-id>/studio-user/JupyterServer/default/LifecycleConfigOnStart

/aws/sagemaker/studio/<domain-id>/studio-user/KernelGateway/datascience-app

For SageMaker to send logs to CloudWatch on your behalf, the caller of the Training/Processing/Transform job APIs will need the following permissions:

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "logs:CreateLogDelivery", "logs:CreateLogGroup", "logs:CreateLogStream", "logs:DeleteLogDelivery", "logs:Describe*", "logs:GetLogEvents", "logs:GetLogDelivery", "logs:ListLogDeliveries", "logs:PutLogEvents", "logs:PutResourcePolicy", "logs:UpdateLogDelivery" ], "Resource": "*", "Effect": "Allow" } ] }

To encrypt those logs with a custom AWS KMS key, you will first need to modify the key policy to allow the CloudWatch service to encrypt and decrypt the key. Once you create a log encryption AWS KMS key, modify the key policy to include the following:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "logs.region.amazonaws.com" }, "Action": [ "kms:Encrypt*", "kms:Decrypt*", "kms:ReEncrypt*", "kms:GenerateDataKey*", "kms:Describe*" ], "Resource": "*", "Condition": { "ArnLike": { "kms:EncryptionContext:aws:logs:arn": "arn:aws:logs:region:account-id:*" } } } ] }

Note that you can always use ArnEquals and provide a specific Amazon Resource Name (ARN) for the CloudWatch log you want to encrypt. Here we are showing that you can use this key to encrypt all logs in an account for simplicity. Additionally, training, processing, and model endpoints publish metrics about the instance CPU and memory utilization, hosting invocation latency, and so on. You can further configure Amazon SNS to notify administrators of events when certain thresholds are crossed. The consumer of the training and processing APIs needs to have the following permissions:

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "cloudwatch:DeleteAlarms", "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "cloudwatch:PutMetricAlarm", "cloudwatch:PutMetricData", "sns:ListTopics" ], "Resource": "*", "Effect": "Allow", "Condition": { "StringLike": { "cloudwatch:namespace": "aws/sagemaker/*" } } }, { "Action": [ "sns:Subscribe", "sns:CreateTopic" ], "Resource": [ "arn:aws:sns:*:*:*SageMaker*", "arn:aws:sns:*:*:*Sagemaker*", "arn:aws:sns:*:*:*sagemaker*" ], "Effect": "Allow" } ] }

Audit with AWS CloudTrail

To improve your compliance posture, audit all your APIs with AWS CloudTrail. By default, all SageMaker APIs are logged with AWS CloudTrail. You do not need any additional IAM permissions to enable CloudTrail.

All SageMaker actions, with the exception of InvokeEndpoint and InvokeEndpointAsync, are logged by CloudTrail and are documented in the operations. For example, calls to the CreateTrainingJob, CreateEndpoint, and CreateNotebookInstance actions generate entries in the CloudTrail log files.

Every CloudTrail event entry contains information about who generated the request. The identity information helps you determine the following:

  • Whether the request was made with root or AWS IAM user credentials.

  • Whether the request was made with temporary security credentials for a role or federated user.

  • Whether the request was made by another AWS service. For an example event, refer to the Log SageMaker API Calls with CloudTrail documentation.

By default, CloudTrail logs the Studio execution role name of the user profile as the identifier for each event. This works if each user has their own execution role. If multiple users share the same execution role, you can use the sourceIdentity configuration to propagate the Studio user profile name to CloudTrail. Refer to Monitoring user resource access from Amazon SageMaker Studio to enable the sourceIdentity feature. In a shared space, all actions refer to the space ARN as the source, and you cannot audit through sourceIdentity.