SageMaker Python SDK Troubleshooting Guide
You can use the SageMaker Python SDK to interact with Amazon SageMaker AI within your Python scripts or Jupyter notebooks. Despite the SDK providing a simplified workflow, you might encounter various exceptions or errors. This troubleshooting guide aims to help you understand and resolve common issues that might arise when working with the SageMaker Python SDK. It covers scenarios related to creating training jobs, processing jobs, and endpoints, as well as general exception handling practices. By following the guidance provided in the following sections, you can effectively diagnose and address common issues.
The SageMaker Python SDK acts as a wrapper for the low level SageMaker API operations. The IAM role that you're using to access the SDK must be able to access the underlying operations. Adding the SageMaker AI Full Access Policy to your IAM role is the most straightforward way to make sure you have permissions to use the SageMaker Python SDK. For more information about the SageMaker AI Full Access Policy, see Amazon SageMaker AI Full Access.
While less convenient, providing more granular permissions is a secure approach to using the SDK. Each of the following sections has information about the permissions required.
Create a Training Job
Important
If you're not adding the SageMaker AI Full Access policy to your IAM role, it must have permissions to call the CreateTrainingJob and DescribeTrainingJob operations.
It also requires permissions to:
-
Access input/output data in S3
-
Run Amazon EC2 instances
-
Log CloudWatch metrics
If your SageMaker training job needs to access resources in an Amazon Virtual Private Cloud (Amazon VPC), make sure that you configure the necessary VPC settings and security groups when you create the processing job.
When you're creating a training job, you might run into botocore.exceptions.ClientError
or ValueError
exceptions.
Update a Training Job
Important
If you're not adding the SageMaker AI Managed Policy to your IAM role, you must give the role access to the following permissions:
-
s3:GetObject
– Provides permissions to read the model artifacts from Amazon S3 buckets -
s3:PutObject
– If applicable, provides permissions to write updates to the model artifacts -
iam:GetRole
– Provides permissions to get information about the IAM role needed to run the training job -
sagemaker:UpdateTrainingJob
– Provides permissions to modify the training jobs using the UpdateTrainingJob operation. -
logs:PutLogEvents
– Provides permissions to write logs to Amazon CloudWatch logs during the update process.
When you update a training job, you might run into a botocore.exceptions.ParamValidationError
or a botocore.exceptions.ClientError
.
Create a Processing Job
Important
If you're not adding the SageMaker AI Managed Policy to your IAM role, you must give the role access to the following permissions:
-
sagemaker:CreateProcessingJob
– Provides permissions to create a processing job -
sagemaker:DescribeProcessingJob
– Provides permissions to get information about a processing job -
s3:GetObject
– Provides permissions to read the model artifacts from Amazon S3 buckets -
s3:PutObject
– If applicable, provides permissions to write updates to the model artifacts -
logs:PutLogEvents
– Provides permissions to writing logs to Amazon CloudWatch logs during the update process.
If your processing job needs to access resources within an Amazon Virtual Private Cloud, you must
specify its security_group_ids
and subnets
within the
estimator that you create. For an example of how you can access resources within a
Amazon VPC, see Secure Training and Inference with VPC
When you're creating a processing job, you might run into a ValueError
, an
UnexpectedStatusException
, or a botocore.exceptions.ClientError
.
Create an Endpoint
Important
If you're not adding the SageMaker AI Managed Policy to your IAM role, you must give the role access to the following permissions:
-
sagemaker:CreateModel
– Provides permissions to create the model that you're deploying to the endpoint -
sagemaker:CreateEndpointConfig
– Provides permissions to create an endpoint configuration that define the endpoint's behavior, such as the instance type and count -
sagemaker:CreateEndpoint
– Provides permissions to create the endpoint configuration using the endpoint that you've specified
Additionally, you need permissions to describe and list the models, endpoints, and endpoint configurations.
When you're creating an endpoint, you might run into an
UnexpectedStatusException
or a botocore.exceptions.ClientError
.
The following is an example of an UnexpectedStatusException
:
UnexpectedStatusException: Error hosting endpoint gpt2-large-2024-07-03-15-28-20-448: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.. Try changing the instance type or reference the troubleshooting page https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-troubleshooting.html
The error message tells you to check the Amazon CloudWatch logs. Use the following procedure to check the logs.
To check the CloudWatch logs
-
Navigate to the Amazon SageMaker AI console
. -
On the left hand navigation, choose Endpoints.
-
Select the endpoint that has failed.
-
On the Endpoint details page, choose View logs in CloudWatch.
After you've found the logs, look for the specific issue. The following is an example of a CloudWatch log:
NotImplementedError: gptq quantization is not supported for AutoModel, you can try to quantize it with text-generation-server quantize ORIGINAL_MODEL_ID NEW_MODEL_ID
For information about resolving the botocore.exceptions.ClientError
, see Guidance on exception handling.
Update an Endpoint
Important
If you're not adding the SageMaker AI Managed Policy to your IAM role, you must give the role access to the following permissions:
-
sagemaker:UpdateEndpoint
– Provides permissions to update an existing endpoint, such as changing the endpoint's instance type or count -
sagemaker:UpdateEndpointWeightsAndCapacities
– Provides permissions to create an endpoint configuration that define the endpoint's behavior, such as the instance type and count -
sagemaker:DescribeEndpoint
– Provides permissions to describe the current configuration of the endpoint, which is often required before the update
Additionally, you might need permissions to describe and list the endpoints and endpoint configurations.
You can run into a ValueError
, such as the following:
ValueError: Endpoint with name 'abc' does not exist; please use an existing endpoint name
The error indicates that the specified endpoint name does not match any existing endpoints in your AWS account. Use the following procedure to troubleshoot the error:
To troubleshoot a Value Error
-
Use the following code to list all of your endpoints:
import sagemaker sagemaker_session = sagemaker.Session() # List all endpoints endpoints = sagemaker_session.sagemaker_client.list_endpoints() print(endpoints)
-
Verify that the endpoint you've specified to the
update_endpoint
function is in the list. -
Make sure that you're operating in the correct AWS Region. SageMaker AI endpoints are region-specific.
-
Make sure that the IAM role that you're using has permissions to list, describe, or update the endpoints.
Guidance on exception handling
If you can't find information to help you fix your specific issue, the following code examples can give you inspiration for how you handle exceptions.
The following is a generic example that you can use to catch most exceptions.
import sagemaker from botocore.exceptions import ParamValidationError, ClientError try: sagemaker.some_api_call(SomeParam='some_param') except ClientError as error: # Put your error handling logic here raise error except ParamValidationError as error: raise ValueError('The parameters you provided are incorrect: {}'.format(error)) except ValueError as error: # Catch generic ValueError exceptions
There are two main categories of errors:
-
Errors specific to the SageMaker Python SDK
-
Errors specific to the underlying AWS service
Errors specific to the underlying AWS service are always botocore.exceptions.ClientError
exceptions. The botocore.exceptions.ClientError
has an Error
object and a ResponseMetadata
object. The following shows the template of a client error:
{ 'Error': { 'Code': 'SomeServiceException', 'Message': 'Details/context around the exception or error' }, 'ResponseMetadata': { 'RequestId': '1234567890ABCDEF', 'HostId': 'host ID data will appear here as a hash', 'HTTPStatusCode': 400, 'HTTPHeaders': {'header metadata key/values will appear here'}, 'RetryAttempts': 0 } }
The following is an example of the specific error handling that you can do with the botocore.exceptions.ClientError
:
try: sagemaker.some_api_call(SomeParam='some_param') except botocore.exceptions.ClientError as err: if err.response['Error']['Code'] == 'InternalError': # Generic error # We grab the message, request ID, and HTTP code to give to customer support print('Error Message: {}'.format(err.response['Error']['Message'])) print('Request ID: {}'.format(err.response['ResponseMetadata']['RequestId'])) print('Http code: {}'.format(err.response['ResponseMetadata']['HTTPStatusCode'])) raise err else if err.response['Error']['Code'] == 'ValidationException': raise ValueError(err.response['Error']['Message'])
For more information about how you can handle ClientError
exceptions, see Parsing error responses and catching exceptions from AWS services