Match input data using a matching workflow - AWS Entity Resolution

Match input data using a matching workflow

A matching workflow is a data processing job that combines and compares data from different input sources and determines which of it matches based on different matching techniques. It produces a data output table.

When you create a matching workflow, you first specify your data inputs, normalization steps, and then choose your desired matching techniques and data output. AWS Entity Resolution reads your data from your specified location or locations and finds a match between two or more records in your data. It then assigns a Match ID to the records in the matched set of data. AWS Entity Resolution then writes data output files to a location that you choose. You can use AWS Entity Resolution to hash output data if desired – helping you maintain control over your data.

A matching workflow can have multiple runs and the results (successes or errors) are written to a folder with the jobId as the name.

The data output contains both a file for successful matches and a file for errors. The data output can contain multiple fields. The successful results are written to a success folder that contains multiple files, and each file contains a subset of the successful records. Similarly, errors are written to an error folder with multiple fields, with each containing a subset of the error records. For more information about troubleshooting errors, see Troubleshooting matching workflows.

The following diagram summarizes how to create a matching workflow.

A summary of the four steps to create a matching workflow in AWS Entity Resolution

Before you create a matching workflow, you must first create a schema mapping. For more information, see Creating a schema mapping.

There are three ways to create a matching workflow, based on matching techniques: rule-based, machine learning-based, or provider service-based.

After you create and run a matching workflow, you can do the following:

For example, to save provider subscription costs, you can first run rule-based matching to find matches on your data. Then, you can send a subset of unmatched records to provider service-based matching.