Record matching with AWS Lake Formation FindMatches

Note

Record matching is currently unavailable in the following Regions in the AWS Glue console: Middle East (UAE), Europe (Spain), Asia Pacific (Jakarta), and Europe (Zurich).

AWS Lake Formation provides machine learning capabilities to create custom transforms to cleanse your data. There is currently one available transform named FindMatches. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly. This will not require writing any code or knowing how machine learning works. FindMatches can be useful in many different problems, such as:

Matching customers: Linking customer records across different customer databases, even when many customer fields do not match exactly across the databases (e.g. different name spelling, address differences, missing or inaccurate data, etc).
Matching products: Matching products in your catalog against other product sources, such as product catalog against a competitor's catalog, where entries are structured differently.
Improving fraud detection: Identifying duplicate customer accounts, determining when a newly created account is (or might be) a match for a previously known fraudulent user.
Other matching problems: Match addresses, movies, parts lists, etc etc. In general, if a human being could look at your database rows and determine that they were a match, there is a really good chance that the FindMatches transform can help you.

You can create these transforms when you create a job. The transform that you create is based on a source data store schema and example data from the source data set that you label (we call this process "teaching" a transform). The records that you label must be present in the source dataset. In this process we generate a file which you label and then upload back which the transform would in a manner learn from. After you teach your transform, you can call it from your Spark-based AWS Glue job (PySpark or Scala Spark) and use it in other scripts with a compatible source data store.

After the transform is created, it is stored in AWS Glue. On the AWS Glue console, you can manage the transforms that you create. In the navigation pane under Data Integration and ETL, Data classification tools > Record Matching, you can edit and continue to teach your machine learning transform. For more information about managing transforms on the console, see Working with machine learning transforms.

Note

AWS Glue version 2.0 FindMatches jobs use the Amazon S3 bucket aws-glue-temp-<accountID>-<region> to store temporary files while the transform is processing data. You can delete this data after the run has completed, either manually or by setting an Amazon S3 Lifecycle rule.

Types of machine learning transforms

You can create machine learning transforms to cleanse your data. You can call these transforms from your ETL script. Your data passes from transform to transform in a data structure called a DynamicFrame, which is an extension to an Apache Spark SQL DataFrame. The DynamicFrame contains your data, and you reference its schema to process your data.

The following types of machine learning transforms are available:

Find matches

Finds duplicate records in the source data. You teach this machine learning transform by labeling example datasets to indicate which rows match. The machine learning transform learns which rows should be matches the more you teach it with example labeled data. Depending on how you configure the transform, the output is one of the following:

A copy of the input table plus a match_id column filled in with values that indicate matching sets of records. The match_id column is an arbitrary identifier. Any records which have the same match_id have been identified as matching to each other. Records with different match_id's do not match.
A copy of the input table with duplicate rows removed. If multiple duplicates are found, then the record with the lowest primary key is kept.

Find incremental matches

The Find matches transform can also be configured to find matches across the existing and incremental frames and return as output a column containing a unique ID per match group.

For more information, see: Finding incremental matches

Using the FindMatches transform

You can use the FindMatches transform to find duplicate records in the source data. A labeling file is generated or provided to help teach the transform.

Note

Currently, FindMatches transforms that use a custom encryption key aren't supported in the following Regions:

Asia Pacific (Osaka) - ap-northeast-3

To get started with the FindMatches transform, you can follow the steps below. For a more advanced and detailed example, see the AWS Big Data Blog: Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view .

Getting started using the Find Matches transform

Follow these steps to get started with the FindMatches transform:

Create a table in the AWS Glue Data Catalog for the source data that is to be cleaned. For information about how to create a crawler, see Working with Crawlers on the AWS Glue Console.

If your source data is a text-based file such as a comma-separated values (CSV) file, consider the following:
- Keep your input record CSV file and labeling file in separate folders. Otherwise, the AWS Glue crawler might consider them as multiple parts of the same table and create tables in the Data Catalog incorrectly.
- Unless your CSV file includes ASCII characters only, ensure that UTF-8 without BOM (byte order mark) encoding is used for the CSV files. Microsoft Excel often adds a BOM in the beginning of UTF-8 CSV files. To remove it, open the CSV file in a text editor, and resave the file as UTF-8 without BOM.
On the AWS Glue console, create a job, and choose the Find matches transform type.

Important
The data source table that you choose for the job can't have more than 100 columns.
Tell AWS Glue to generate a labeling file by choosing Generate labeling file. AWS Glue takes the first pass at grouping similar records for each labeling_set_id so that you can review those groupings. You label matches in the label column.
- If you already have a labeling file, that is, an example of records that indicate matching rows, upload the file to Amazon Simple Storage Service (Amazon S3). For information about the format of the labeling file, see Labeling file format. Proceed to step 4.
Download the labeling file and label the file as described in the Labeling section.
Upload the corrected labelled file. AWS Glue runs tasks to teach the transform how to find matches.

On the Machine learning transforms list page, choose the History tab. This page indicates when AWS Glue performs the following tasks:
- Import labels
- Export labels
- Generate labels
- Estimate quality
To create a better transform, you can iteratively download, label, and upload the labelled file. In the initial runs, a lot more records might be mismatched. But AWS Glue learns as you continue to teach it by verifying the labeling file.
Evaluate and tune your transform by evaluating performance and results of finding matches. For more information, see Tuning machine learning transforms in AWS Glue.

Labeling

When FindMatches generates a labeling file, records are selected from your source table. Based on previous training, FindMatches identifies the most valuable records to learn from.

The act of labeling is editing a labeling file (we suggest using a spreadsheet such as Microsoft Excel) and adding identifiers, or labels, into the label column that identifies matching and nonmatching records. It is important to have a clear and consistent definition of a match in your source data. FindMatches learns from which records you designate as matches (or not) and uses your decisions to learn how to find duplicate records.

When a labeling file is generated by FindMatches, approximately 100 records are generated. These 100 records are typically divided into 10 labeling sets, where each labeling set is identified by a unique labeling_set_id generated by FindMatches. Each labeling set should be viewed as a separate labeling task independent of the other labeling sets. Your task is to identify matching and non-matching records within each labeling set.

Tips for editing labeling files in a spreadsheet

When editing the labeling file in a spreadsheet application, consider the following:

The file might not open with column fields fully expanded. You might need to expand the labeling_set_id and label columns to see content in those cells.
If the primary key column is a number, such as a long data type, the spreadsheet might interpret it as a number and change the value. This key value must be treated as text. To correct this problem, format all the cells in the primary key column as Text data.

Labeling file format

The labeling file that is generated by AWS Glue to teach your FindMatches transform uses the following format. If you generate your own file for AWS Glue, it must follow this format as well:

It is a comma-separated values (CSV) file.
It must be encoded in UTF-8. If you edit the file using Microsoft Windows, it might be encoded with cp1252.
It must be in an Amazon S3 location to pass it to AWS Glue.
Use a moderate number of rows for each labeling task. 10–20 rows per task are recommended, although 2–30 rows per task are acceptable. Tasks larger than 50 rows are not recommended and may cause poor results or system failure.
If you have already-labeled data consisting of pairs of records labeled as a "match" or a "no-match", this is fine. These labeled pairs can be represented as labeling sets of size 2. In this case label both records with, for instance, a letter "A" if they match, but label one as "A" and one as "B" if they do not match.

Note
Because it has additional columns, the labeling file has a different schema from a file that contains your source data. Place the labeling file in a different folder from any transform input CSV file so that the AWS Glue crawler does not consider it when it creates tables in the Data Catalog. Otherwise, the tables created by the AWS Glue crawler might not correctly represent your data.
The first two columns (labeling_set_id, label) are required by AWS Glue. The remaining columns must match the schema of the data that is to be processed.
For each labeling_set_id, you identify all matching records by using the same label. A label is a unique string placed in the label column. We recommend using labels that contain simple characters, such as A, B, C, and so on. Labels are case sensitive and are entered in the label column.
Rows that contain the same labeling_set_id and the same label are understood to be labeled as a match.
Rows that contain the same labeling_set_id and a different label are understood to be labeled as not a match

Rows that contain a different labeling_set_id are understood to be conveying no information for or against matching.

The following is an example of labeling the data:

labeling_set_id	label	first_name	last_name	Birthday
ABC123	A	John	Doe	04/01/1980
ABC123	B	Jane	Smith	04/03/1980
ABC123	A	Johnny	Doe	04/01/1980
ABC123	A	Jon	Doe	04/01/1980
DEF345	A	Richard	Jones	12/11/1992
DEF345	A	Rich	Jones	11/12/1992
DEF345	B	Sarah	Jones	12/11/1992
DEF345	C	Richie	Jones Jr.	05/06/2017
DEF345	B	Sarah	Jones-Walker	12/11/1992
GHI678	A	Robert	Miller	1/3/1999
GHI678	A	Bob	Miller	1/3/1999
XYZABC	A	William	Robinson	2/5/2001
XYZABC	B	Andrew	Robinson	2/5/1971

In the above example we identify John/Johnny/Jon Doe as being a match and we teach the system that these records do not match Jane Smith. Separately, we teach the system that Richard and Rich Jones are the same person, but that these records are not a match to Sarah Jones/Jones-Walker and Richie Jones Jr.
As you can see, the scope of the labels is limited to the labeling_set_id. So labels do not cross labeling_set_id boundaries. For example, a label "A" in labeling_set_id 1 does not have any relation to label "A" in labeling_set_id 2.
If a record does not have any matches within a labeling set, then assign it a unique label. For instance, Jane Smith does not match any record in labeling set ABC123, so it is the only record in that labeling set with the label of B.
The labeling set "GHI678" shows that a labeling set can consist of just two records which are given the same label to show that they match. Similarly, "XYZABC" shows two records given different labels to show that they do not match.
Note that sometimes a labeling set may contain no matches (that is, you give every record in the labeling set a different label) or a labeling set might all be "the same" (you gave them all the same label). This is okay as long as your labeling sets collectively contain examples of records that are and are not "the same" by your criteria.

Important

Confirm that the IAM role that you pass to AWS Glue has access to the Amazon S3 bucket that contains the labeling file. By convention, AWS Glue policies grant permission to Amazon S3 buckets or folders whose names are prefixed with aws-glue-. If your labeling files are in a different location, add permission to that location in the IAM role.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Streaming ETL jobs

Tuning machine learning transforms