Partial Dependence Plots: Analysis Configuration and Output - Amazon SageMaker

Partial Dependence Plots: Analysis Configuration and Output

Partial dependence plots (PDP) show the dependence of the predicted target response on a set of input features of interest. These are marginalized over the values of all other input features and are referred to as the complement features. Intuitively, you can interpret the partial dependence as the target response, which is expected as a function of each input feature of interest.

Partial dependence plots analysis configuration

To create a partial dependence plot (PDP), Amazon SageMaker Clarify initially looks for the feature columns specified in a JSON array of the analysis_config.json. The other parameters that configure the analysis of a processing job must be provided in this JSON file. For more information about configuring PDPs and other aspects of an analysis, see Configure the Analysis.

The following code contains an example of a JSON "pdp" object in the "methods" object of an analysis_config.json. configuration file.

{ "dataset_type":... "baseline": [[..]] . . "methods": { "shap" : { "baseline": ".." "num_samples": 100 }, "pdp": { "features": ["Age", "MaturityMonths"] // The features for which we need to plot PDP. "grid_resolution": 20, //Required for numerical columns only. //The number of buckets into which the range of values is divided. "top_k_features": 10, //Specifies how many of the top features must be used for PDP plots. The default is 10. }, . . } . . }

If "features" is not mentioned in the "pdp" object but "shap" config is provided, SageMaker Clarify takes top ten features from the global SHAP results to plot the PDP visualizations.

Partial dependence plots analysis output

The following code shows an example of the partial dependence plot (PDP) schema returned in the analysis.json result file. The "pdp" section in this analysis output file contains the information required to generate the PDP plots. Each dictionary in the list contains the specification for the PDP of the feature specified by the feature_name.

The data_type indicates whether the data is numerical or categorical. The feature_values field contains the values present in the feature. If the data_type inferred by Clarify is categorical, feature_values contain all the unique values that the feature could assume. If the data_type inferred by Clarify is numerical, it contains a list of the central values of each of the grid_resolution number of buckets generated by Clarify.

If the partial dependence plots are computed for a particular feature, the feature_values, model_predictions, and data_distributions fields are replaced by the error field which contains an error message.

{ "version": "1.0", "explanations":{ "kernel_shap":{ . . . }, "pdp": [ { "feature_name": "Age", "data_type": "numerical" "feature_values": [ 20.4, 23.2, 26.0, 28.799999999999997, 31.599999999999998, 34.4, 70.8, 73.6 ], "model_predictions": [ [ 0.6830344458296895, 0.6812452118471265, 0.6908621763065458, 0.7008252082392573, 0.733054383918643, 0.7352442337572574, 0.7337257475033403, 0.7395857129991055, ], ], "data_distribution": [ 0.13, 0.25, 0.15, 0.35. 0.17 ] }, { "feature_name": "text_column", "data_type": "free_text" "error": "Detected data type is not supported for PDP. PDP can only be computed for numerical or categorical columns" } ] } }

This PDP schema generates the following partial dependence plot for the Age feature. The PDP plots the feature_values along the x-axis. The y-axis contains the values in model_predictions field. Each list in the model_predictions field corresponds to one class in the output from the model.

                    Partial dependence plot for Age.

You can view the plot in the report.pdf file in the analysis output path that you provided.