Tuning Machine Learning Transforms in AWS Glue - AWS Glue

Tuning Machine Learning Transforms in AWS Glue

You can tune your machine learning transforms in AWS Glue to improve the results of your data-cleansing jobs to meet your objectives. To improve your transform, you can teach it by generating a labeling set, adding labels, and then repeating these steps several times until you get your desired results. You can also tune by changing some machine learning parameters.

For more information about machine learning transforms, see Matching Records with AWS Lake Formation FindMatches.

To understand the measurements that are used to tune your machine learning transform, you should be familiar with the following terminology:

True positive (TP)

A match in the data that the transform correctly found, sometimes called a hit.

True negative (TN)

A nonmatch in the data that the transform correctly rejected.

False positive (FP)

A nonmatch in the data that the transform incorrectly classified as a match, sometimes called a false alarm.

False negative (FN)

A match in the data that the transform didn't find, sometimes called a miss.

For more information about the terminology that is used in machine learning, see Confusion matrix in Wikipedia.

To tune your machine learning transforms, you can change the value of the following measurements in the Advanced properties of the transform.

  • Precision measures how well the transform finds true positives among the total number of records that it identifies as positive (true positives and false positives). For more information, see Precision and recall in Wikipedia.

  • Recall measures how well the transform finds true positives from the total records in the source data. For more information, see Precision and recall in Wikipedia.

  • Accuracy measures how well the transform finds true positives and true negatives. Increasing accuracy requires more machine resources and cost. But it also results in increased recall. For more information, see Accuracy and precision in Wikipedia.

  • Cost measures how many compute resources (and thus money) are consumed to run the transform.