Amazon SageMaker
Developer Guide

Associate Prediction Results with Input Records

When making predictions on a large dataset, you can exclude attributes that aren't needed for prediction. After the predictions have been made, you can associate some of the excluded attributes with those predictions or with other input data in your report. By using batch transform to perform these data processing steps, you can often eliminate additional preprocessing or postprocessing. You can use input files in JSON and CVS format only.

Workflow for Associating Inferences with Input Records

The following diagram shows the workflow for associating inferences with input records.

To associate inferences with input data, there are three main steps:

  1. Filter the input data that is not needed for inference before passing the input data to the batch transform job. Use the InputFilter parameter to determine which attributes to use as input for the model.

  2. Associate the input data with the inference results. Use the JoinSource parameter to combine the input data with the inference.

  3. Filter the joined data to retain the inputs that are needed to provide context for interpreting the predictions in the reports. Use OutputFilter to store the specified portion of the joined dataset in the output file.

Use Data Processing in Batch Transform Jobs

When creating a batch transform job with CreateTransformJob to process data:

  1. Specify the portion of the input to pass to the model with the InputFilter parameter in the DataProcessing data structure.

  2. Join the raw input data with the transformed data with the JoinSource parameter.

  3. Specify which portion of the joined input and transformed data from the batch transform job to include in the output file with the OutputFilter parameter.

  4. Choose either JSON- or CSV-formatted files for input:

    • For JSON- or JSON Lines-formatted input files, Amazon SageMaker either adds theSageMakerOutput attribute to the input file or creates a new JSON output file with the SageMakerInput and SageMakerOutput attributes. For more information, see DataProcessing.

    • For CSV-formatted input files, the joined input data is followed by the transformed data and the output is a CSV file.

If you use an algorithm with the DataProcessing structure, it must support your chosen format for both input and output files. For example, with the TransformOutput field of the CreateTransformJob API, you must set both the ContentType and Accept parameters to one of the following values: text/csv, application/json, or application/jsonlines. The syntax for specifying columns in a CSV file and specifying attributes in a JSON file are different. Using the wrong syntax causes an error. For more information, see Batch Transform Examples. For more information about input and output file formats for built-in algorithms, see Use Amazon SageMaker Built-in Algorithms .

The record delimiters for the input and output must also be consistent with your chosen file input. The SplitType parameter indicates how to split the records in the input dataset. The AssembleWith parameter indicates how to reassemble the records for the output. If you set input and output formats to text/csv, you must also set the SplitType and AssemblyType parameters to line. If you set the input and output formats to application/jsonlines, you can set both SplitType and AssemblyType to either none or line.

For JSON files, the attribute name SageMakerOutput is reserved for output. The JSON input file can't have an attribute with this name. If it does, the data in the input file might be overwritten.

Supported JSONPath Operators

To filter and join the input data and inference, use a JSONPath subexpression. The following table lists the supported JSONPath operators.

JSONPath Operator Description Example
$

The root element to a query. This operator is required at the beginning of all path expressions.

"$"
.<name>

A dot-notated child element.

"$.id"

*

A wildcard. Use in place of an attribute name or numeric value.

"$.id.*"

['<name>' (,'<name>')]

A bracket-notated element or multiple child elements.

"$['id','SageMakerOutput']"

[<number> (,<number>)]

An index or array of indexes. Negative index values are also supported. A -1 index refers to the last element in an array.

$[1] , $[1,3,5]

[<start>:<end>]

An array slice operator. The array slice() method extracts a section of an array and returns a new array. If you omit <start>, Amazon SageMaker uses the first element of the array. If you omit <end>, Amazon SageMaker uses the last element of the array.

$[2:5], $[:5], $[2:]

Note

Amazon SageMaker supports only a subset of the defined JSONPath operators. For more information about JSONPath operators, see JsonPath on GitHub.

Batch Transform Examples

The following examples show some common ways to join input data with prediction results.

Example: Output Only Inferences

By default, the DataProcessing parameter doesn't join inference results with input. It outputs only the inference results.

If you want to explicitly specify to not join results with input, use the Amazon SageMaker Python SDK and specify the following settings in a transformer call.

sm_transformer = sagemaker.transformer.Transformer(…) sm_transformer.transform(…, input_filter="$", join_source= "None", output_filter="$")

The following code shows the default behavior. To output an inference using only the AWS SDK for Python, add it to your CreateTransformJob request.

{ "DataProcessing": { "InputFilter": "$", "JoinSource": "None", "OutputFilter": "$" } }

Example: Output Input Data and Inferences

If you're using the Amazon SageMaker Python SDK, to combine the input data with the inferences in the output file, specify "Input" for the JoinSource parameter in a transformer call.

sm_transformer = sagemaker.transformer.Transformer(…) sm_transformer.transform(…, join_source= "Input")

If you're using the AWS SDK for Python (Boto 3), join all input data with the inference by adding the following code to your CreateTransformJob request.

{ "DataProcessing": { "JoinSource": "Input" } }

For JSON or JSON Lines input files, the results are in the SageMakerOutput key in the input JSON file. For example, if the input is a JSON file that contains the key-value pair {"key":1}, the data transform result might be {"label":1}.

Amazon SageMaker stores both in the input file in the SageMakerInput key.

{ "key":1, "SageMakerOutput":{"label":1} }

Note

The joined result for JSON must be a key-value pair object. If the input isn't a key-value pair object, Amazon SageMaker creates a new JSON file. In the new JSON file, the input data is stored in the SageMakerInput key and the results are stored as the SageMakerOutput value.

For a CSV file, for example, if the record is [1,2,3], and the label result is [1], then the output file would contain [1,2,3,1].

Example: Output an ID Column with Results and Exclude the ID Column from the Input (CSV)

If you are using the Amazon SageMaker Python SDK, to include results or an ID column in the output, specify indexes of the joined dataset in a transformer call. For example, if your data includes five columns and the first one is the ID column, use the following transformer request.

sm_transformer = sagemaker.transformer.Transformer(…) sm_transformer.transform(…, input_filter="$[1:]", join_source= "Input", output_filter="$[0,5:]")

If you are using the AWS SDK for Python (Boto 3), add the following code to your CreateTransformJob request.

{ "DataProcessing": { "InputFilter": "$[1:]", "JoinSource": "Input", "OutputFilter": "$[0,5:]" } }

To specify columns in Amazon SageMaker, index the array elements. The first column is 0, the second column is 1, and the sixth column is 5. To exclude the first column from the input, set InputFilter to "$[1:]".

Example: Output an ID Attribute with Results and Exclude the ID Attribute from the Input (JSON)

If you are using the Amazon SageMaker Python SDK, include results or an ID attribute in the output by specifying it in a transformer call. For example, if you store data in the features attribute and the record ID in the ID attribute, you would use the following transformer request.

sm_transformer = sagemaker.transformer.Transformer(…) sm_transformer.transform(…, input_filter="$.features", join_source= "Input", output_filter="$['id','SageMakerOutput']")

If you are using the AWS SDK for Python (Boto 3), join all input data with the inference by adding the following code to your CreateTransformJob request.

{ "DataProcessing": { "InputFilter": "$.features", "JoinSource": "Input", "OutputFilter": "$['id','SageMakerOutput']" } }

Warning

If you are using a JSON-formatted input file, the file can't contain the attribute name SageMakerOutput. This attribute name is reserved for the output file. If your JSON-formatted input filecontains an attribute with this name, values in the input file might be overwritten with the inference.