You can use the SageMaker Python SDK to interact with Amazon SageMaker AI within your Python scripts or Jupyter notebooks. Despite the SDK providing a simplified workflow, you might encounter various exceptions or errors. This troubleshooting guide aims to help you understand and resolve common issues that might arise when working with the SageMaker Python SDK. It covers scenarios related to creating training jobs, processing jobs, and endpoints, as well as general exception handling practices. By following the guidance provided in the following sections, you can effectively diagnose and address common issues.
The SageMaker Python SDK acts as a wrapper for the low level SageMaker API operations. The IAM role that you're using to access the SDK must be able to access the underlying operations. Adding the SageMaker AI Full Access Policy to your IAM role is the most straightforward way to make sure you have permissions to use the SageMaker Python SDK. For more information about the SageMaker AI Full Access Policy, see Amazon SageMaker AI Full Access.
While less convenient, providing more granular permissions is a secure approach to using the SDK. Each of the following sections has information about the permissions required.
Create a
Training Job
Important
If you're not adding the SageMaker AI Full Access policy to your IAM role, it must have permissions to call the CreateTrainingJob and DescribeTrainingJob operations.
It also requires permissions to:
-
Access input/output data in S3
-
Run Amazon EC2 instances
-
Log CloudWatch metrics
If your SageMaker training job needs to access resources in an Amazon Virtual Private Cloud (Amazon VPC), make sure that you configure the necessary VPC settings and security groups when you create the processing job.
When you're creating a training job, you might run into botocore.exceptions.ClientError
or ValueError
exceptions.
ValueError
exceptions occur when there's an issue with the values or
parameters that you're passing to a function. Use the following list to see examples of
ValueError
exceptions and how to fix them.
-
ValueError: either image_uri or algorithm_arn is required. None was provided
:-
If you're using the
AlgorithmEstimator
function, provide thealgorithm_arn
. -
If you're using the
Estimator
function, provide theestimator_arn
.
-
-
ValueError: Unknown input channel: train is not supported by: scikit-decision-trees-15423055-57b73412d2e93e9239e4e16f83298b8f
You get this error when you provide an invalid input channel. An input channel is a data source or parameter that the model expects.
On the Types of Algorithms page, you can navigate to the model to find information about the model's input channels.
You can also find information about the input channels within the Usage section on the AWS Marketplace page of the algorithm.
Use the following procedure to get information about an algorithm's input channels.
To get information about an algorithm's input channels
-
Navigate to the SageMaker AI console
. -
On the left-hand navigation, choose Training.
-
Select Algorithms.
-
Choose Find algorithm.
-
Find your algorithm in the resulting list.
-
Select the Usage tab.
-
Navigate to the Channel specification heading.
-
Update a
Training Job
Important
If you're not adding the SageMaker AI Managed Policy to your IAM role, you must give the role access to the following permissions:
-
s3:GetObject
– Provides permissions to read the model artifacts from Amazon S3 buckets -
s3:PutObject
– If applicable, provides permissions to write updates to the model artifacts -
iam:GetRole
– Provides permissions to get information about the IAM role needed to run the training job -
sagemaker:UpdateTrainingJob
– Provides permissions to modify the training jobs using the UpdateTrainingJob operation. -
logs:PutLogEvents
– Provides permissions to write logs to Amazon CloudWatch logs during the update process.
When you update a training job, you might run into a botocore.exceptions.ParamValidationError
or a botocore.exceptions.ClientError
.
The ClientError
has the following message:
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the UpdateTrainingJob operation: Invalid UpdateTrainingJobRequest, the request cannot be empty
If you're running into this error, you must include one of the following parameters along with the name of the training job:
-
profiler_rule_configs
(list) – A list of profiler rule configurations. By default, there are no profiler rule configurations. -
profiler_config
(dict) – The configuration for SageMaker AI Profiler collects metrics and send them out. By default, there is no profiler configuration. -
resource_config
(dict) – The configuration for the training job resources. You can update the keep-alive period if the warm pool status isAvailable
. No other fields can be updated. -
remote_debug_config
(dict) – Configuration forRemoteDebug
. The dictionary can containEnableRemoteDebug
(bool).
Create a
Processing Job
Important
If you're not adding the SageMaker AI Managed Policy to your IAM role, you must give the role access to the following permissions:
-
sagemaker:CreateProcessingJob
– Provides permissions to create a processing job -
sagemaker:DescribeProcessingJob
– Provides permissions to get information about a processing job -
s3:GetObject
– Provides permissions to read the model artifacts from Amazon S3 buckets -
s3:PutObject
– If applicable, provides permissions to write updates to the model artifacts -
logs:PutLogEvents
– Provides permissions to writing logs to Amazon CloudWatch logs during the update process.
If your processing job needs to access resources within an Amazon Virtual Private Cloud, you must
specify its security_group_ids
and subnets
within the
estimator that you create. For an example of how you can access resources within a
Amazon VPC, see Secure Training and Inference with VPC
When you're creating a processing job, you might run into a ValueError
, an
UnexpectedStatusException
, or a botocore.exceptions.ClientError
.
The following is an example of a ValueError
:
ValueError: code preprocess.py wasn't found. Please make sure that the file exists.
The path that you've specified wasn't correct. You can specify either a relative path
or an absolute path to your script file. For more information about specifying paths to
your files, see sagemaker.processing.RunArgs
Create an
Endpoint
Important
If you're not adding the SageMaker AI Managed Policy to your IAM role, you must give the role access to the following permissions:
-
sagemaker:CreateModel
– Provides permissions to create the model that you're deploying to the endpoint -
sagemaker:CreateEndpointConfig
– Provides permissions to create an endpoint configuration that define the endpoint's behavior, such as the instance type and count -
sagemaker:CreateEndpoint
– Provides permissions to create the endpoint configuration using the endpoint that you've specified
Additionally, you need permissions to describe and list the models, endpoints, and endpoint configurations.
When you're creating an endpoint, you might run into an
UnexpectedStatusException
or a botocore.exceptions.ClientError
.
The following is an example of an UnexpectedStatusException
:
UnexpectedStatusException: Error hosting endpoint gpt2-large-2024-07-03-15-28-20-448: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.. Try changing the instance type or reference the troubleshooting page https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-troubleshooting.html
The error message tells you to check the Amazon CloudWatch logs. Use the following procedure to check the logs.
To check the CloudWatch logs
-
Navigate to the Amazon SageMaker AI console
. -
On the left hand navigation, choose Endpoints.
-
Select the endpoint that has failed.
-
On the Endpoint details page, choose View logs in CloudWatch.
After you've found the logs, look for the specific issue. The following is an example of a CloudWatch log:
NotImplementedError: gptq quantization is not supported for AutoModel, you can try to quantize it with text-generation-server quantize ORIGINAL_MODEL_ID NEW_MODEL_ID
For information about resolving the botocore.exceptions.ClientError
, see Guidance on exception handling.
Update an
Endpoint
Important
If you're not adding the SageMaker AI Managed Policy to your IAM role, you must give the role access to the following permissions:
-
sagemaker:UpdateEndpoint
– Provides permissions to update an existing endpoint, such as changing the endpoint's instance type or count -
sagemaker:UpdateEndpointWeightsAndCapacities
– Provides permissions to create an endpoint configuration that define the endpoint's behavior, such as the instance type and count -
sagemaker:DescribeEndpoint
– Provides permissions to describe the current configuration of the endpoint, which is often required before the update
Additionally, you might need permissions to describe and list the endpoints and endpoint configurations.
You can run into a ValueError
, such as the following:
ValueError: Endpoint with name 'abc' does not exist; please use an existing endpoint name
The error indicates that the specified endpoint name does not match any existing endpoints in your AWS account. Use the following procedure to troubleshoot the error:
To troubleshoot a Value Error
-
Use the following code to list all of your endpoints:
import sagemaker sagemaker_session = sagemaker.Session() # List all endpoints endpoints = sagemaker_session.sagemaker_client.list_endpoints() print(endpoints)
-
Verify that the endpoint you've specified to the
update_endpoint
function is in the list. -
Make sure that you're operating in the correct AWS Region. SageMaker AI endpoints are region-specific.
-
Make sure that the IAM role that you're using has permissions to list, describe, or update the endpoints.
Guidance on exception handling
If you can't find information to help you fix your specific issue, the following code examples can give you inspiration for how you handle exceptions.
The following is a generic example that you can use to catch most exceptions.
import sagemaker
from botocore.exceptions import ParamValidationError, ClientError
try:
sagemaker.some_api_call(SomeParam='some_param')
except ClientError as error:
# Put your error handling logic here
raise error
except ParamValidationError as error:
raise ValueError('The parameters you provided are incorrect: {}'.format(error))
except ValueError as error:
# Catch generic ValueError exceptions
There are two main categories of errors:
-
Errors specific to the SageMaker Python SDK
-
Errors specific to the underlying AWS service
Errors specific to the underlying AWS service are always botocore.exceptions.ClientError
exceptions. The botocore.exceptions.ClientError
has an Error
object and a ResponseMetadata
object. The following shows the template of a client error:
{
'Error': {
'Code': 'SomeServiceException',
'Message': 'Details/context around the exception or error'
},
'ResponseMetadata': {
'RequestId': '1234567890ABCDEF',
'HostId': 'host ID data will appear here as a hash',
'HTTPStatusCode': 400,
'HTTPHeaders': {'header metadata key/values will appear here'},
'RetryAttempts': 0
}
}
The following is an example of the specific error handling that you can do with the botocore.exceptions.ClientError
:
try:
sagemaker.some_api_call(SomeParam='some_param')
except botocore.exceptions.ClientError as err:
if err.response['Error']['Code'] == 'InternalError': # Generic error
# We grab the message, request ID, and HTTP code to give to customer support
print('Error Message: {}'.format(err.response['Error']['Message']))
print('Request ID: {}'.format(err.response['ResponseMetadata']['RequestId']))
print('Http code: {}'.format(err.response['ResponseMetadata']['HTTPStatusCode']))
raise err
else if err.response['Error']['Code'] == 'ValidationException':
raise ValueError(err.response['Error']['Message'])
For more information about how you can handle ClientError
exceptions, see Parsing error responses and catching exceptions from AWS services