Configuring Anomaly detection and generating insights - AWS Glue

Configuring Anomaly detection and generating insights

AWS Glue Data Quality (DQ) evaluates your data based on the data quality rules that you write and provides insights and observations about your data over time so that you can take immediate action. Since DQ scans your data, DQ computes statistical metrics such as row count, maximum or minimum, and then compares them against threshold expressions.

Some of the benfits of Data Quality anomaly detection include:

  • continuous automated scanning of data

  • detection of anomalies that can be indicative of an unintended event or statistical abnormality

  • offer Rule Recommendations to take action on observations found by Data Quality anomaly detection

This is useful if you:

  • want to detect anomalies on your data automatically, without the need to write data quality

  • want to profile your data and view visual representations of what the data looks like

  • want to track how your data changes over time

What observations can I view about my data?

DQ identifies outliers in the data statistics gathered, changes in data formats, data drifts and schema changes. Based on observations, DQ recommends data quality rules that users can easily operationalize. Statistics include Completeness, Uniqueness, Mean, Sum, StandardDeviation, Entropy, DistinctValuesCount, and UniqueValueRatio.

Enabling anomaly detection in AWS Glue Studio

To enable anomaly detection, you can open a AWS Glue Studio job and toggle on “Enable Anomaly Detection”. Turning this on enables anomaly detection on your data by analyzing your data over time and providing data statistics about your data and observations that you can act on.

To enable anomaly detection in AWS Glue Studio:
  1. Choose the Data Quality node in your job, then choose the Anomaly detection tab. Toggle on ‘Enable Anomaly Detection’.

    The screenshot shows the toggle for "Enable anomaly detection" on.
  2. Define the data to monitor for anomalies by choosing Add analyzer. There are two fields you can populate: Statistics and Data.

    Statistics are information about your data’s shape and other properties. You can choose one or more statistics at a time, or choose All statistics. Statistics include: Completeness, Uniqueness, Mean, Sum, StandardDeviation, Entropy, DistinctValuesCount, and UniqueValueRatio.

    Data is the columns in your data set. You can choose all columns or individual columns.

    The screenshot shows the fields for Statistics and Data. You can choose which statistics you want to apply to your dataset and on which columns.
  3. Choose Add anomaly detection scope to save your changes. When you’ve created analyzers, you can see them in the Anomaly detection scope section.

    You can also use the Actions menu to edit your analyzers, or choose the Ruleset editor tab and edit the analyzer directly in the ruleset editor notepad. You will see the analyzers you saved just below any rules you’ve created.

    Rules = [ ] Analyzers = [ Completeness “id” ]

    With the updated ruleset along with analyzers, Data Quality continuously monitors incoming data, signaling anomalies through alerts or job stops based on your settings.

Note

Observations are generated when a minimum of three values per data statistics are observed in your data set. If there are no observations visible, Data quality does not have enough data to generate an observation. After several job runs, Data quality can provide insights into your data and will display them in the Observations section.

Analyzers generate observations by detecting anomalies in your data and provide you recommendations to progressively build rules. You can view the observations by choosing the Data Quality tab. Observations are specific to each job run. You can view the specific Data Quality node and job run at the top of the Observations section. Choose a new node or job run to view observations specific to that node and job.

The screenshot shows the Data quality tab for a job and observations that are presented for the job run.

Observation – each insight is based for a specific job run configured by the rulesets and analyzers you specified.

Related metrics – When observations are generated, the Related metrics column shows you the rule and actual and expected values, as well as lower and upper limits.

Rule recommendations – AWS Glue then also recommends rules to address this. Each rule that is recommended can be copied by clicking the copy icon. You can copy all recommended rules by clicking the copy icon next to each rule and then clicking Apply copied rules.

Monitored data – The Monitored data column provides the column or row that was monitored and triggered the observation.

After an observation has been generated and a recommended rule is provided, you can apply that rule to your data quality node. To do this:

  1. Click the copy icon next to each rule recommendation. This will add the rule recommendation to a notepad that you can retrieve later.

  2. Click Apply rule recommendations. This opens the notepad where you can view the rules you previously copied.

  3. Choose Copy rules.

  4. Choose Apply to ruleset editor. This opens the ruleset editor where you can paste the copied rules.

  5. Paste the copied rules to the ruleset editor.