Configure a SageMaker Clarify Processing Job
To analyze your data and models for bias and explainability using SageMaker Clarify, you must
configure a SageMaker Clarify processing job. This guide shows how to specify the input dataset name,
analysis configuration file name, and output location for a processing job. To configure the
processing container, job inputs, outputs, resources and other parameters, you have two
options. You can either use the SageMaker CreateProcessingJob
API, or use the SageMaker
Python SDK API SageMaker ClarifyProcessor
,
For information about parameters that are common to all processing jobs, see Amazon SageMaker API Reference.
The following instructions show how to provide each portion of the SageMaker Clarify specific
configuration using the CreateProcessingJob
API.
-
Input the uniform research identifier (URI) of a SageMaker Clarify container image inside the
AppSpecification
parameter, as shown in the following code example.{ "ImageUri": "
the-clarify-container-image-uri
" }Note
The URI must identify a pre-built SageMaker Clarify container image.
ContainerEntrypoint
andContainerArguments
are not supported. For more information about SageMaker Clarify container images, see Get Started with a SageMaker Clarify Container. -
Specify both the configuration for your analysis and parameters for your input dataset inside the
ProcessingInputs
parameter.-
Specify the location of the JSON analysis configuration file, which includes the parameters for bias analysis and explainability analysis. The
InputName
parameter of theProcessingInput
object must beanalysis_config
as shown in the following code example.{ "InputName": "analysis_config", "S3Input": { "S3Uri": "
s3://your-bucket/analysis_config.json
", "S3DataType": "S3Prefix", "S3InputMode": "File", "LocalPath": "/opt/ml/processing/input/config
" } }For more information about the schema of the analysis configuration file, see Configure the Analysis .
-
Specify the location of the input dataset. The
InputName
parameter of theProcessingInput
object must bedataset
. This parameter is optional if you have provided the "dataset_uri" in the analysis configuration file. The following values are required in theS3Input
configuration.-
S3Uri
can be either an Amazon S3 object or an S3 prefix. -
S3InputMode
must be of typeFile
. -
S3CompressionType
must be of typeNone
(the default value). -
S3DataDistributionType
must be of typeFullyReplicated
(the default value). -
S3DataType
can be eitherS3Prefix
orManifestFile
. To useManifestFile
, theS3Uri
parameter should specify the location of a manifest file that follows the schema from the SageMaker API Reference section S3Uri. This manifest file must list the S3 objects that contain the input data for the job.
The following code shows an example of an input configuration.
{ "InputName": "dataset", "S3Input": { "S3Uri": "
s3://your-bucket/your-dataset.csv
", "S3DataType": "S3Prefix", "S3InputMode": "File", "LocalPath": "/opt/ml/processing/input/data
" } } -
-
-
Specify the configuration for the output of the processing job inside the
ProcessingOutputConfig
parameter. A singleProcessingOutput
object is required in theOutputs
configuration. The following are required of the output configuration:-
OutputName
must beanalysis_result
. -
S3Uri
must be an S3 prefix to the output location. -
S3UploadMode
must be set toEndOfJob
.
The following code shows an example of an output configuration.
{ "Outputs": [{ "OutputName": "analysis_result", "S3Output": { "S3Uri": "
s3://your-bucket/result/
", "S3UploadMode": "EndOfJob", "LocalPath": "/opt/ml/processing/output
" } }] } -
-
Specify the configuration
ClusterConfig
for the resources that you use in your processing job inside theProcessingResources
parameter. The following parameters are required inside theClusterConfig
object.-
InstanceCount
specifies the number of compute instances in the cluster that runs the processing job. Specify a value greater than1
to activate distributed processing. -
InstanceType
refers to the resources that runs your processing job. Because SageMaker SHAP analysis is compute-intensive, using an instance type that is optimized for compute should improve runtime for analysis. The SageMaker Clarify processing job doesn't use GPUs.
The following code shows an example of resource configuration.
{ "ClusterConfig": { "InstanceCount":
1
, "InstanceType": "ml.m5.xlarge
", "VolumeSizeInGB":20
} } -
-
Specify the configuration of the network that you use in your processing job inside the
NetworkConfig
object. The following values are required in the configuration.-
EnableNetworkIsolation
must be set toFalse
(default) so that SageMaker Clarify can invoke an endpoint, if necessary, for predictions. -
If the model or endpoint that you provided to the SageMaker Clarify job is within an Amazon Virtual Private Cloud (Amazon VPC), then the SageMaker Clarify job must also be in the same VPC. Specify the VPC using VpcConfig. Additionally, the VPC must have endpoints to an Amazon S3 bucket, SageMaker service and SageMaker Runtime service.
If distributed processing is activated, you must also allow communication between different instances in the same processing job. Configure a rule for your security group that allows inbound connections between members of the same security group. For more information, see Give Amazon SageMaker Clarify Jobs Access to Resources in Your Amazon VPC.
The following code gives an example of a network configuration.
{ "EnableNetworkIsolation": False, "VpcConfig": { ... } }
-
-
Set the maximum time that the job will run using the
StoppingCondition
parameter. The longest that a SageMaker Clarify job can run is7
days or604800
seconds. If the job cannot be completed within this time limit, it will be stopped and no analysis results will be provided. As an example, the following configuration limits the maximum time that the job can run to 3600 seconds.{ "MaxRuntimeInSeconds": 3600 }
-
Specify an IAM role for the
RoleArn
parameter. The role must have a trust relationship with Amazon SageMaker. It can be used to perform the SageMaker API operations listed in the following table. We recommend using the Amazon SageMakerFullAccess managed policy, which grants full access to SageMaker. For more information on this policy, see AWS managed policy: AmazonSageMakerFullAccess. If you have concerns about granting full access, the minimal permissions required depend on whether you provide a model or an endpoint name. Using an endpoint name allows for granting fewer permissions to SageMaker.The following table contains API operations used by the SageMaker Clarify processing job. An
X
under Model name and Endpoint name notes the API operation that is required for each input.API Operation Model name Endpoint name What is it used for X
Tags of the job are applied to the shadow endpoint.
X
Create endpoint config using the model name that you provided
X
Create shadow endpoint using the endpoint config.
X
X
Describe endpoint for its status, the endpoint must be InService to serve requests.
X
X
Invoke the endpoint for predictions.
For more information about required permissions, see Amazon SageMaker API Permissions: Actions, Permissions, and Resources Reference.
For more information about passing roles to SageMaker, see Passing Roles.
After you have the individual pieces of the processing job configuration, combine them to configure the job.
The following code example shows how to launch a SageMaker Clarify processing job using the
AWS SDK for
Python
sagemaker_client.create_processing_job( ProcessingJobName="
your-clarify-job-name
", AppSpecification={ "ImageUri": "the-clarify-container-image-uri
", }, ProcessingInputs=[{ "InputName": "analysis_config", "S3Input": { "S3Uri": "s3://your-bucket/analysis_config.json
", "S3DataType": "S3Prefix", "S3InputMode": "File", "LocalPath": "/opt/ml/processing/input/config
", }, }, { "InputName": "dataset", "S3Input": { "S3Uri": "s3://your-bucket/your-dataset.csv
", "S3DataType": "S3Prefix", "S3InputMode": "File", "LocalPath": "/opt/ml/processing/input/data
", }, }, ], ProcessingOutputConfig={ "Outputs": [{ "OutputName": "analysis_result", "S3Output": { "S3Uri": "s3://your-bucket/result/
", "S3UploadMode": "EndOfJob", "LocalPath": "/opt/ml/processing/output
", }, }], }, ProcessingResources={ "ClusterConfig": { "InstanceCount":1
, "InstanceType": "ml.m5.xlarge
", "VolumeSizeInGB":20
, }, }, NetworkConfig={ "EnableNetworkIsolation": False, "VpcConfig": { ... }, }, StoppingCondition={ "MaxRuntimeInSeconds":3600
, }, RoleArn="arn:aws:iam::<your-account-id>:role/service-role/AmazonSageMaker-ExecutionRole
", )
For an example notebook with instructions for running a SageMaker Clarify processing job using
AWS SDK for Python, see Fairness and Explainability with SageMaker Clarify using AWS SDK for Python
You can also configure a SageMaker Clarify processing job using the SageMaker ClarifyProcessor