Configure a SageMaker Clarify Processing Job - Amazon SageMaker

Configure a SageMaker Clarify Processing Job

To analyze your data and models for bias and explainability using SageMaker Clarify, you must configure a SageMaker Clarify processing job. This guide shows how to specify the input dataset name, analysis configuration file name, and output location for a processing job. To configure the processing container, job inputs, outputs, resources and other parameters, you have two options. You can either use the SageMaker CreateProcessingJob API, or use the SageMaker Python SDK API SageMaker ClarifyProcessor,

For information about parameters that are common to all processing jobs, see Amazon SageMaker API Reference.

The following instructions show how to provide each portion of the SageMaker Clarify specific configuration using the CreateProcessingJob API.

  1. Input the uniform research identifier (URI) of a SageMaker Clarify container image inside the AppSpecification parameter, as shown in the following code example.

    { "ImageUri": "the-clarify-container-image-uri" }
    Note

    The URI must identify a pre-built SageMaker Clarify container image. ContainerEntrypoint and ContainerArguments are not supported. For more information about SageMaker Clarify container images, see Get Started with a SageMaker Clarify Container.

  2. Specify both the configuration for your analysis and parameters for your input dataset inside the ProcessingInputs parameter.

    1. Specify the location of the JSON analysis configuration file, which includes the parameters for bias analysis and explainability analysis. The InputName parameter of the ProcessingInput object must be analysis_config as shown in the following code example.

      { "InputName": "analysis_config", "S3Input": { "S3Uri": "s3://your-bucket/analysis_config.json", "S3DataType": "S3Prefix", "S3InputMode": "File", "LocalPath": "/opt/ml/processing/input/config" } }

      For more information about the schema of the analysis configuration file, see Configure the Analysis .

    2. Specify the location of the input dataset. The InputName parameter of the ProcessingInput object must be dataset. This parameter is optional if you have provided the "dataset_uri" in the analysis configuration file. The following values are required in the S3Input configuration.

      1. S3Urican be either an Amazon S3 object or an S3 prefix.

      2. S3InputMode must be of type File.

      3. S3CompressionType must be of type None (the default value).

      4. S3DataDistributionType must be of type FullyReplicated (the default value).

      5. S3DataType can be either S3Prefix or ManifestFile. To use ManifestFile, the S3Uri parameter should specify the location of a manifest file that follows the schema from the SageMaker API Reference section S3Uri. This manifest file must list the S3 objects that contain the input data for the job.

      The following code shows an example of an input configuration.

      { "InputName": "dataset", "S3Input": { "S3Uri": "s3://your-bucket/your-dataset.csv", "S3DataType": "S3Prefix", "S3InputMode": "File", "LocalPath": "/opt/ml/processing/input/data" } }
  3. Specify the configuration for the output of the processing job inside the ProcessingOutputConfig parameter. A single ProcessingOutput object is required in the Outputs configuration. The following are required of the output configuration:

    1. OutputName must be analysis_result.

    2. S3Urimust be an S3 prefix to the output location.

    3. S3UploadMode must be set to EndOfJob.

    The following code shows an example of an output configuration.

    { "Outputs": [{ "OutputName": "analysis_result", "S3Output": { "S3Uri": "s3://your-bucket/result/", "S3UploadMode": "EndOfJob", "LocalPath": "/opt/ml/processing/output" } }] }
  4. Specify the configuration ClusterConfig for the resources that you use in your processing job inside the ProcessingResources parameter. The following parameters are required inside the ClusterConfig object.

    1. InstanceCount specifies the number of compute instances in the cluster that runs the processing job. Specify a value greater than 1 to activate distributed processing.

    2. InstanceType refers to the resources that runs your processing job. Because SageMaker SHAP analysis is compute-intensive, using an instance type that is optimized for compute should improve runtime for analysis. The SageMaker Clarify processing job doesn't use GPUs.

    The following code shows an example of resource configuration.

    { "ClusterConfig": { "InstanceCount": 1, "InstanceType": "ml.m5.xlarge", "VolumeSizeInGB": 20 } }
  5. Specify the configuration of the network that you use in your processing job inside the NetworkConfig object. The following values are required in the configuration.

    1. EnableNetworkIsolation must be set to False (default) so that SageMaker Clarify can invoke an endpoint, if necessary, for predictions.

    2. If the model or endpoint that you provided to the SageMaker Clarify job is within an Amazon Virtual Private Cloud (Amazon VPC), then the SageMaker Clarify job must also be in the same VPC. Specify the VPC using VpcConfig. Additionally, the VPC must have endpoints to an Amazon S3 bucket, SageMaker service and SageMaker Runtime service.

      If distributed processing is activated, you must also allow communication between different instances in the same processing job. Configure a rule for your security group that allows inbound connections between members of the same security group. For more information, see Give Amazon SageMaker Clarify Jobs Access to Resources in Your Amazon VPC.

    The following code gives an example of a network configuration.

    { "EnableNetworkIsolation": False, "VpcConfig": { ... } }
  6. Set the maximum time that the job will run using the StoppingCondition parameter. The longest that a SageMaker Clarify job can run is 7 days or 604800 seconds. If the job cannot be completed within this time limit, it will be stopped and no analysis results will be provided. As an example, the following configuration limits the maximum time that the job can run to 3600 seconds.

    { "MaxRuntimeInSeconds": 3600 }
  7. Specify an IAM role for the RoleArn parameter. The role must have a trust relationship with Amazon SageMaker. It can be used to perform the SageMaker API operations listed in the following table. We recommend using the Amazon SageMakerFullAccess managed policy, which grants full access to SageMaker. For more information on this policy, see AWS managed policy: AmazonSageMakerFullAccess. If you have concerns about granting full access, the minimal permissions required depend on whether you provide a model or an endpoint name. Using an endpoint name allows for granting fewer permissions to SageMaker.

    The following table contains API operations used by the SageMaker Clarify processing job. An X under Model name and Endpoint name notes the API operation that is required for each input.

    API Operation Model name Endpoint name What is it used for

    ListTags

    X

    Tags of the job are applied to the shadow endpoint.

    CreateEndpointConfig

    X

    Create endpoint config using the model name that you provided

    CreateEndpoint

    X

    Create shadow endpoint using the endpoint config.

    DescribeEndpoint

    X

    X

    Describe endpoint for its status, the endpoint must be InService to serve requests.

    InvokeEndpoint

    X

    X

    Invoke the endpoint for predictions.

    For more information about required permissions, see Amazon SageMaker API Permissions: Actions, Permissions, and Resources Reference.

    For more information about passing roles to SageMaker, see Passing Roles.

    After you have the individual pieces of the processing job configuration, combine them to configure the job.

The following code example shows how to launch a SageMaker Clarify processing job using the AWS SDK for Python.

sagemaker_client.create_processing_job( ProcessingJobName="your-clarify-job-name", AppSpecification={ "ImageUri": "the-clarify-container-image-uri", }, ProcessingInputs=[{ "InputName": "analysis_config", "S3Input": { "S3Uri": "s3://your-bucket/analysis_config.json", "S3DataType": "S3Prefix", "S3InputMode": "File", "LocalPath": "/opt/ml/processing/input/config", }, }, { "InputName": "dataset", "S3Input": { "S3Uri": "s3://your-bucket/your-dataset.csv", "S3DataType": "S3Prefix", "S3InputMode": "File", "LocalPath": "/opt/ml/processing/input/data", }, }, ], ProcessingOutputConfig={ "Outputs": [{ "OutputName": "analysis_result", "S3Output": { "S3Uri": "s3://your-bucket/result/", "S3UploadMode": "EndOfJob", "LocalPath": "/opt/ml/processing/output", }, }], }, ProcessingResources={ "ClusterConfig": { "InstanceCount": 1, "InstanceType": "ml.m5.xlarge", "VolumeSizeInGB": 20, }, }, NetworkConfig={ "EnableNetworkIsolation": False, "VpcConfig": { ... }, }, StoppingCondition={ "MaxRuntimeInSeconds": 3600, }, RoleArn="arn:aws:iam::<your-account-id>:role/service-role/AmazonSageMaker-ExecutionRole", )

For an example notebook with instructions for running a SageMaker Clarify processing job using AWS SDK for Python, see Fairness and Explainability with SageMaker Clarify using AWS SDK for Python. Any S3 bucket used in the notebook must be in the same AWS Region as the notebook instance that accesses it.

You can also configure a SageMaker Clarify processing job using the SageMaker ClarifyProcessor in the SageMaker Python SDK API. For more information, see Run SageMaker Clarify Processing Jobs for Bias Analysis and Explainability.