Data Quality Definition Language (DQDL) reference - AWS Glue

Data Quality Definition Language (DQDL) reference

Data Quality Definition Language (DQDL) is a domain specific language that you use to define rules for AWS Glue Data Quality.

This guide introduces key DQDL concepts to help you understand the language. It also provides a reference for DQDL rule types with syntax and examples. Before you use this guide, we recommend that you have familiarity with AWS Glue Data Quality. For more information, see AWS Glue Data Quality.

Note

DynamicRules are only supported in AWS Glue ETL.

DQDL syntax

A DQDL document is case sensitive and contains a ruleset, which groups individual data quality rules together. To construct a ruleset, you must create a list named Rules (capitalized), delimited by a pair of square brackets. The list should contain one or more comma-separated DQDL rules like the following example.

Rules = [ IsComplete "order-id", IsUnique "order-id" ]

Rule structure

The structure of a DQDL rule depends on the rule type. However, DQDL rules generally fit the following format.

<RuleType> <Parameter> <Parameter> <Expression>

RuleType is the case-sensitive name of the rule type that you want to configure. For example, IsComplete, IsUnique, or CustomSql. Rule parameters differ for each rule type. For a complete reference of DQDL rule types and their parameters, see DQDL rule type reference.

Composite rules

DQDL supports the following logical operators that you can use to combine rules. These rules are called Composite Rules.

and

The logical and operator results in true if and only if the rules that it connects are true. Otherwise, the combined rule results in false. Each rule that you connect with the and operator must be surrounded by parentheses.

The following example uses the and operator to combine two DQDL rules.

(IsComplete "id") and (IsUnique "id")
or

The logical or operator results in true if and only if one or more of the rules that it connects are true. Each rule that you connect with the or operator must be surrounded by parentheses.

The following example uses the or operator to combine two DQDL rules.

(RowCount "id" > 100) or (IsPrimaryKey "id")

You can use the same operator to connect multiple rules, so the following rule combination is allowed.

(Mean "Star_Rating" > 3) and (Mean "Order_Total" > 500) and (IsComplete "Order_Id")

However, you can't combine the logical operators into a single expression. For example, the following combination is not allowed.

(Mean "Star_Rating" > 3) and (Mean "Order_Total" > 500) or (IsComplete "Order_Id")

How Composite rules work

By default, Composite Rules are evaluated as individual rules across the entire dataset or table and then the results are combined. In other words, it evaluates the entire column first and then applies the operator. This default behaviour is explained below with an example:

# Dataset +------+------+ |myCol1|myCol2| +------+------+ | 2| 1| | 0| 3| +------+------+ # Overall outcome +----------------------------------------------------------+-------+ |Rule |Outcome| +----------------------------------------------------------+-------+ |(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)|Failed | +----------------------------------------------------------+-------+

In the above example, AWS Glue Data Quality first evaluates (ColumnValues "myCol1" > 1) which will result in a failure. Then it will evaluate (ColumnValues "myCol2" > 2) which will also fail. The combination of both results will be noted as FAILED.

However, if you prefer an SQL like behaviour, where you need the entire row to be evaluated, you have to explicitly set the ruleEvaluation.scope parameter as shown in additionalOptions in the code snippet below.

object GlueApp { val datasource = glueContext.getCatalogSource( database="<db>", tableName="<table>", transformationContext="datasource" ).getDynamicFrame() val ruleset = """ Rules = [ (ColumnValues "age" >= 26) OR (ColumnLength "name" >= 4) ] """ val dq_results = EvaluateDataQuality.processRows( frame=datasource, ruleset=ruleset, additionalOptions=JsonOptions(""" { "compositeRuleEvaluation.method":"ROW" } """ ) ) }

Once set, the composite rules will behave as a single rule evaluating the entire row. The following example illustrates this behaviour.

# Row Level outcome +------+------+------------------------------------------------------------+---------------------------+ |myCol1|myCol2|DataQualityRulesPass |DataQualityEvaluationResult| +------+------+------------------------------------------------------------+---------------------------+ |2 |1 |[(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)]|Passed | |0 |3 |[(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)]|Passed | +------+------+------------------------------------------------------------+---------------------------+

Some rules cannot be supported in this feature because their overall outcome rely on thresholds or ratios. They are listed below.

Rules relying on ratios:

  • Completeness

  • DatasetMatch

  • ReferentialIntegrity

  • Uniqueness

Rules dependent on thresholds:

When the following rules include with threshold, they are not supported. However, rules that do not involve with threshold remain supported.

  • ColumnDataType

  • ColumnValues

  • CustomSQL

Expressions

Note

The expressions, not and != are not supported in AWS Glue Data Catalog.

If a rule type doesn't produce a Boolean response, you must provide an expression as a parameter in order to create a Boolean response. For example, the following rule checks the mean (average) of all the values in a column against an expression to return a true or false result.

Mean "colA" between 80 and 100

Some rule types such as IsUnique and IsComplete already return a Boolean response.

The following table lists expressions that you can use in DQDL rules.

Supported DQDL expressions
Expression Description Example
=x Resolves to true if the rule type response is equal to x.
Completeness "colA" = "1.0", ColumnValues "colA" = "2022-06-30"
!=x x Resolves to true if the rule type response is not equal to x.
ColumnValues "colA" != "a", ColumnValues "colA" != "2022-06-30"
> x Resolves to true if the rule type response is greater than x.
ColumnValues "colA" > 10
< x Resolves to true if the rule type response is less than x.
ColumnValues "colA" < 1000, ColumnValues "colA" < "2022-06-30"
>= x Resolves to true if the rule type response is greater than or equal to x.
ColumnValues "colA" >= 10
<= x Resolves to true if the rule type response is less than or equal to x.
ColumnValues "colA" <= 1000
between x and y Resolves to true if the rule type response falls in a specified range (exclusive). Only use this expression type for numeric and date types.
Mean "colA" between 8 and 100, ColumnValues "colA" between "2022-05-31" and "2022-06-30"
not between x and y Resolves to true if the rule type response does not fall in a specified range (inclusive). You should only use this expression type for numeric and date types.
ColumnValues "colA" not between "2022-05-31" and "2022-06-30"
in [a, b, c, ...] Resolves to true if the rule type response is in the specified set.
ColumnValues "colA" in [ 1, 2, 3 ], ColumnValues "colA" in [ "a", "b", "c" ]
not in [a, b, c, ...] Resolves to true if the rule type response is not in the specified set.
ColumnValues "colA" not in [ 1, 2, 3 ], ColumnValues "colA" not in [ "a", "b", "c" ]
matches /ab+c/i Resolves to true if the rule type response matches a regular expression.
ColumnValues "colA" matches "[a-ZA-Z]*"
not matches /ab+c/i Resolves to true if the rule type response does not matche a regular expression.
ColumnValues "colA" not matches "[a-ZA-Z]*"
now() Works only with the ColumnValues rule type to create a date expression.
ColumnValues "load_date" > (now() - 3 days)
matches/in […]/not matches/not in [...] with threshold Specifies the percentage of values that match the rule conditions. Works only with the ColumnValues, ColumnDataType, and CustomSQL rule types.
ColumnValues "colA" in ["A", "B"] with threshold > 0.8, ColumnValues "colA" matches "[a-zA-Z]*" with threshold between 0.2 and 0.9 ColumnDataType "colA" = "Timestamp" with threshold > 0.9

Keywords for NULL, EMPTY and WHITESPACES_ONLY

Note

Keywords for NULL, EMPTY and WHITESPACES_ONLY are not supported in AWS Glue Data Catalog.

If you want to validate if a string column has a null, empty or a string with only whitespaces you can use the following keywords:

  • NULL / null – This keyword resolves to true for a null value in a string column.

    ColumnValues "colA" != NULL with threshold > 0.5 would return true if more than 50% of your data does not have null values.

    (ColumnValues "colA" = NULL) or (ColumnLength "colA" > 5) would return true for all rows which either have a null value or have length >5. Note that this will require the use of the “compositeRuleEvaluation.method” = “ROW” option.

  • EMPTY / empty – This keyword resolves to true for an empty string (“”) value in a string column. Some data formats transform nulls in a string column to empty strings. This keyword helps filter out empty strings in your data.

    (ColumnValues "colA" = EMPTY) or (ColumnValues "colA" in ["a", "b"]) would return true if a row is either empty, “a” or “b”. Note that this requires the use of the “compositeRuleEvaluation.method” = “ROW” option.

  • WHITESPACES_ONLY / whitespaces_only – This keyword resolves to true for a string with only whitespaces (“ ”) value in a string column.

    ColumnValues "colA" not in ["a", "b", WHITESPACES_ONLY] would return true if a row is neither “a” or “b” nor just whitespaces.

    Supported rules:

For a numeric or date based expression, if you want to validate if a column has a null you can use the following keywords.

  • NULL / null – This keyword resolves to true for a null value in a string column.

    ColumnValues "colA" in [NULL, "2023-01-01"] would return true if a dates in your column are either 2023-01-01 or null.

    (ColumnValues "colA" = NULL) or (ColumnValues "colA" between 1 and 9) would return true for all rows which either have a null value or have values between 1 and 9. Note that this will require the use of the “compositeRuleEvaluation.method” = “ROW” option.

    Supported rules:

Dynamic rules

You can now author dynamic rules to compare current metrics produced by your rules with their historical values. These historical comparisons are enabled by using the last() operator in expressions. For example, the rule RowCount > last() will succeed when the number of rows in the current run is greater than the most recent prior row count for the same dataset. last() takes an optional natural number argument describing how many prior metrics to consider; last(k) where k >= 1 will reference the last k metrics.

  • If no data points are available, last(k) will return the default value 0.0.

  • If fewer than k metrics are available, last(k) will return all prior metrics.

To form valid expressions use last(k), where k > 1 requires an aggregation function to reduce multiple historical results to a single number. For example, RowCount > avg(last(5)) will check whether the current dataset’s row count is strictly greater than the average of the last five row counts for the same dataset. RowCount > last(5) will produce an error because the current dataset row count can't be meaningfully compared to a list.

Supported aggregation functions:

  • avg

  • median

  • max

  • min

  • sum

  • std (standard deviation)

  • abs (absolute value)

  • index(last(k), i) will allow for selecting the ith most recent value out of the last k. i is zero-indexed, so index(last(3), 0) will return the most recent datapoint and index(last(3), 3) will result in an error as there are only three datapoints and we attempt to index the 4th most recent one.

Sample expressions

ColumnCorrelation

  • ColumnCorrelation "colA" "colB" < avg(last(10))

DistinctValuesCount

  • DistinctValuesCount "colA" between min(last(10))-1 and max(last(10))+1

Most rule types with numeric conditions or thresholds support dynamic rules; see the provided table, Analyzers and Rules, to determine whether dynamic rules are supported for your rule type.

Analyzers

Note

Analyzers are not supported in AWS Glue Data Catalog.

DQDL rules use functions called analyzers to gather information about your data. This information is employed by a rule’s Boolean expression to determine whether the rule should succeed or fail. For example, the RowCount rule RowCount > 5 will use a row count analyzer to discover the number of rows in your dataset, and compare that count with the expression > 5 to check whether more than five rows exist in the current dataset.

Sometimes, instead of authoring rules, we recommend creating analyzers and then have them generate statistics that can be used to detect anomalies. For such instances, you can create analyzers. Analyzers differ from rules in the following ways.

Characteristic Analyzers Rules
Part of ruleset Yes Yes
Generates statistics Yes Yes
Generates observations Yes Yes
Can evaluate and assert a condition No Yes
You can configure actions such as stop the jobs on failure, continue processing job No Yes

Analyzers can independently exist without rules, so you can quickly configure them and progressively build data quality rules.

Some rule types can be input in the Analyzers block of your ruleset to run the rules required for analyzers and gather information without applying checks for any condition. Some analyzers aren't associated with rules and can only be input in the Analyzers block. The following table indicates whether each item is supported as a rule or a standalone analyzer, along with additional details for each rule type.

Example Ruleset with Analyzer

The following ruleset uses:

  • a dynamic rule to check whether a dataset is growing over its trailing average for the last three job runs

  • a DistinctValuesCount analyzer to record the number of distinct values in the dataset's Name column

  • a ColumnLength analyzer to track minimum and maximum Name size over time

Analyzer metric results can be viewed in the Data Quality tab for your job run.

Rules = [ RowCount > avg(last(3)) ], Analyzers = [ DistinctValuesCount "Name", ColumnLength "Name" ]

Comments

You can use the '#' character to add a comment to your DQDL document. Anything after the '#' character and until the end of the line is ignored by DQDL.

Rules = [ # More items should generally mean a higher price, so correlation should be positive ColumnCorrelation "price" "num_items" > 0 ]

DQDL rule type reference

This section provides a reference for each rule type that AWS Glue Data Quality supports.

Note
  • DQDL doesn't currently support nested or list-type column data.

  • Bracketed values in the below table will be replaced with the information provided in rule arguments.

  • Rules typically require an additional argument for expression

Ruletype Description Arguments Reported Metrics Supported as Rule? Supported as Analyzer? Returns row-level Results? Dynamic rule support? Generates Observations
AggregateMatch Checks if two datasets match by comparing summary metrics like total sales amount. Useful for financial institutions to compare if all data is ingested from source systems. One or more aggregations

When first and second aggregation column names match:

Column.[Column].AggregateMatch

When first and second aggregation column names different:

Column.[Column1,Column2].AggregateMatch

Yes No No No No
AllStatistics Standalone analyzer to gather multiple metrics for the provided column, or all columns in a dataset. A single column name, OR "AllColumns"

For columns of all types:

Dataset.*.RowCount

Column.[Column].Completeness

Column.[Column].Uniqueness

Additional metrics for string-valued columns:

ColumnLength metrics

Additional metrics for numeric-valued columns:

ColumnValues metrics

No Yes No No No
ColumnCorrelation Checks how well two columns are corelated. Exactly two column names Multicolumn.[Column1],[Column2].ColumnCorrelation Yes Yes No Yes No
ColumnCount Checks if any columns are dropped. None Dataset.*.ColumnCount Yes No No Yes Yes
ColumnDataType Checks if a column is compliant with a datatype. Exactly one column name Column.[Column].ColumnDataType.Compliance Yes No No Yes, in row-level threshold expression No
ColumnExists Checks if columns exist in a dataset. This allows customers building self service data platforms to ensure certain columns are made available. Exactly one column name N/A Yes No No No No
ColumnLength Checks if length of data is consistent. Exactly one column name

Column.[Column].MaximumLength

Column.[Column].MinimumLength

Additional metric when row-level threshold provided:

Column.[Column].ColumnValues.Compliance

Yes Yes Yes, when row-level threshold provided No Yes. Only generates observations by analyzing Minimum and Maximum length
ColumnNamesMatchPattern Checks if column names match defined patterns. Useful for governance teams to enforce column name consistency. A regex for column names Dataset.*.ColumnNamesPatternMatchRatio Yes No No No No
ColumnValues Checks if data is consistent per defined values. This rule supports regular expressions. Exactly one column name

Column.[Column].Maximum

Column.[Column].Minimum

Additional metric when row-level threshold provided:

Column.[Column].ColumnValues.Compliance

Yes No Yes, when row-level threshold provided No Yes. Only generates observations by analyzing Minimum and Maximum values
Completeness Checks for any blank or NULLs in data. Exactly one column name

Column.[Column].Completeness

Yes Yes Yes Yes Yes
CustomSql Customers can implement almost any type of data quality checks in SQL.

A SQL statement

(Optional) A row-level threshold

Dataset.*.CustomSQL

Additional metric when row-level threshold provided:

Dataset.*.CustomSQL.Compliance

Yes Yes Yes, when row-level threshold provided Yes No
DataFreshness Checks if data is fresh. Exactly one column name Column.[Column].DataFreshness.Compliance Yes No Yes No No
DatasetMatch Compares two datasets and identifies if they are in synch.

Name of a reference dataset

A column mapping

(Optional) Columns to check for matches

Dataset.[ReferenceDatasetAlias].DatasetMatch Yes No Yes Yes No
DistinctValuesCount Checks for duplicate values. Exactly one column name Column.[Column].DistinctValuesCount Yes Yes Yes Yes Yes
DetectAnomalies Checks for anomalies in another rule type's reported metrics. A rule type Metric(s) reported by the rule type argument Yes No No No No
Entropy Checks for entropy of the data. Exactly one column name Column.[Column].Entropy Yes Yes No Yes No
IsComplete Checks if 100% of the data is complete. Exactly one column name Column.[Column].Completeness Yes No Yes No No
IsPrimaryKey Checks if a column is a primary key (not NULL and unique). Exactly one column name

For single column:

Column.[Column].Uniqueness

For multiple columns:

Multicolumn[CommaDelimitedColumns].Uniqueness

Yes No Yes No No
IsUnique Checks if 100% of the data is unique. Exactly one column name Column.[Column].Uniqueness Yes No Yes No No
Mean Checks if the mean matches the set threshold. Exactly one column name Column.[Column].Mean Yes Yes Yes Yes No
ReferentialIntegrity Checks if two datasets have referential integrity.

One or more column names from dataset

One or more columb names from reference dataset

Column.[ReferenceDatasetAlias].ReferentialIntegrity Yes No Yes Yes No
RowCount Checks if record counts match a threshold. None Dataset.*.RowCount Yes Yes No Yes Yes
RowCountMatch Checks if record counts between two datasets match. Reference dataset alias Dataset.[ReferenceDatasetAlias].RowCountMatch Yes No No Yes No
StandardDeviation Checks if standard deviation matches the threshold. Exactly one column name Column.[Column].StandardDeviation Yes Yes Yes Yes No
SchemaMatch Checks if schema between two datasets match. Reference dataset alias Dataset.[ReferenceDatasetAlias].SchemaMatch Yes No No Yes No
Sum Checks if sum matches a set threshold. Exactly one column name Column.[Column].Sum Yes Yes No Yes No
Uniqueness Checks if uniqueness of dataset matches threshold. Exactly one column name Column.[Column].Uniqueness Yes Yes Yes Yes No
UniqueValueRatio Checks if the unique value ration matches threshold. Exactly one column name Column.[Column].UniqueValueRatio Yes Yes Yes Yes No

AggregateMatch

Checks the ratio of two column aggregations against a given expression. This ruletype works on multiple datasets. The two column aggregations are evaluated and a ratio is produced by dividing the result of the first column aggregation with the result of the second column aggregation. The ratio is checked against the provided expression to produce a boolean response.

Syntax

Column aggregation

ColumnExists <AGG_OPERATION> (<OPTIONAL_REFERENCE_ALIAS>.<COL_NAME>)
  • AGG_OPERATION – The operation to use for the aggregation. Currently, sum and avg are supported.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • OPTIONAL_REFERENCE_ALIAS – This parameter needs to be provided if the column is from a reference dataset and not the primary dataset. If you are using this rule in the AWS Glue Data Catalog, your reference alias must follow the format "<database_name>.<table_name>.<column_name>

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • COL_NAME – The name of the column to aggregate.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

Example: Average

"avg(rating)"

Example: Sum

"sum(amount)"

Example: Average of column in reference dataset

"avg(reference.rating)"

Rule

AggregateMatch <AGG_EXP_1> <AGG_EXP_2> <EXPRESSION>
  • AGG_EXP_1 – The first column aggregation.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • AGG_EXP_2 – The second column aggregation.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Aggregate Match using sum

The following example rule checks whether the sum of the values in the amount column is exactly equal to the sum of the values in the total_amount column.

AggregateMatch "sum(amount)" "sum(total_amount)" = 1.0

Example: Aggregate Match using average

The following example rule checks whether the average of the values in the ratings column is equal to at least 90% of the average of the values in the ratings column in the reference dataset. The reference dataset is provided as an additional data source in the ETL or Data Catalog experience.

In AWS Glue ETL, you can use:

AggregateMatch "avg(ratings)" "avg(reference.ratings)" >= 0.9

In the AWS Glue Data Catalog, you can use:

AggregateMatch "avg(ratings)" "avg(database_name.tablename.ratings)" >= 0.9

Null behavior

The AggregateMatch rule will ignore rows with NULL values in the calculation of the aggregation methods (sum/mean). For example:

+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+

The mean of column units will be (0 + 20 + 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation.

ColumnCorrelation

Checks the correlation between two columns against a given expression. AWS Glue Data Quality uses the Pearson correlation coefficient to measure the linear correlation between two columns. The result is a number between -1 and 1 that measures the strength and direction of the relationship.

Syntax

ColumnCorrelation <COL_1_NAME> <COL_2_NAME> <EXPRESSION>
  • COL_1_NAME – The name of the first column that you want to evaluate the data quality rule against.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • COL_2_NAME – The name of the second column that you want to evaluate the data quality rule against.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Column correlation

The following example rule checks whether the correlation coefficient between the columns height and weight has a strong positive correlation (a coefficient value greater than 0.8).

ColumnCorrelation "height" "weight" > 0.8

Sample dynamic rules

  • ColumnCorrelation "colA" "colB" between min(last(10)) and max(last(10))

  • ColumnCorrelation "colA" "colB" < avg(last(5)) + std(last(5))

Null behavior

The ColumnCorrelation rule will ignore rows with NULL values in the calculation of the correlation. For example:

+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+

Rows 101 and 103 will be ignored, and the ColumnCorrelation will be 1.0.

ColumnCount

Checks the column count of the primary dataset against a given expression. In the expression, you can specify the number of columns or a range of columns using operators like > and <.

Syntax

ColumnCount <EXPRESSION>
  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Column count numeric check

The following example rule checks whether the column count is within a given range.

ColumnCount between 10 and 20

Sample dynamic rules

  • ColumnCount >= avg(last(10))

  • ColumnCount between min(last(10))-1 and max(last(10))+1

ColumnDataType

Checks the inherent data type of the values in a given column against the provided expected type. Accepts a with threshold expression to check for a subset of the values in the column.

Syntax

ColumnDataType <COL_NAME> = <EXPECTED_TYPE> ColumnDataType <COL_NAME> = <EXPECTED_TYPE> with threshold <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: String type

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPECTED_TYPE – The expected type of the values in the column.

    Supported values: Boolean, Date, Timestamp, Integer, Double, Float, Long

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An optional expression to specify the percentage of values that should be of the expected type.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

Example: Column data type integers as strings

The following example rule checks whether the values in the given column, which is of type string, are actually integers.

ColumnDataType "colA" = "INTEGER"

Example: Column data type integers as strings check for a subset of the values

The following example rule checks whether more than 90% of the values in the given column, which is of type string, are actually integers.

ColumnDataType "colA" = "INTEGER" with threshold > 0.9

ColumnExists

Checks whether a column exists.

Syntax

ColumnExists <COL_NAME>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

Example: Column exists

The following example rule checks whether the column named Middle_Name exists.

ColumnExists "Middle_Name"

ColumnLength

Checks whether the length of each row in a column conforms to a given expression.

Syntax

ColumnLength <COL_NAME><EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: String

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Column row length

The following example rule checks whether the value in each row in the column named Postal_Code is 5 characters long.

ColumnLength "Postal_Code" = 5

Null behavior

The ColumnLength rule treats NULLs as 0 length strings. For a NULL row:

ColumnLength "Postal_Code" > 4 # this will fail
ColumnLength "Postal_Code" < 6 # this will succeed

The following example compound rule provides a way to explicitly fail NULL values:

(ColumnLength "Postal_Code" > 4) AND (ColumnValues != NULL)

ColumnNamesMatchPattern

Checks whether the names of all columns in the primary dataset match the given regular expression.

Syntax

ColumnNamesMatchPattern <PATTERN>
  • PATTERN – The pattern you want to evaluate the data quality rule against.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

Example: Column names match pattern

The following example rule checks whether all columns start with the prefix "aws_"

ColumnNamesMatchPattern "aws_.*"

ColumnValues

Runs an expression against the values in a column.

Syntax

ColumnValues <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Allowed values

The following example rule checks whether each value in the specified column is in a set of allowed values (including null, empty, and strings with only whitespaces).

ColumnValues "Country" in [ "US", "CA", "UK", NULL, EMPTY, WHITESPACES_ONLY ]

Example: Regular expression

The following example rule checks the values in a column against a regular expression.

ColumnValues "First_Name" matches "[a-ZA-Z]*"

Example: Date values

The following example rule checks the values in a date column against a date expression.

ColumnValues "Load_Date" > (now() - 3 days)

Example: Numeric values

The following example rule checks whether the column values match a certain numeric constraint.

ColumnValues "Customer_ID" between 1 and 2000

Null behavior

For all ColumnValues rules (other than != and NOT IN), NULL rows will fail the rule. If the rule fails due to a null value, the failure reason will display the following:

Value: NULL does not meet the constraint requirement!

The following example compound rule provides a way to explicitly allow for NULL values:

(ColumnValues "Age" > 21) OR (ColumnValues "Age" = NULL)

Negated ColumnValues rules using the != and not in syntax will pass for NULL rows. For example:

ColumnValues "Age" != 21
ColumnValues "Age" not in [21, 22, 23]

The following examples provide a way to explicitly fail NULL values

(ColumnValues "Age" != 21) AND (ColumnValues "Age" != NULL)
ColumnValues "Age" not in [21, 22, 23, NULL]

Completeness

Checks the percentage of complete (non-null) values in a column against a given expression.

Syntax

Completeness <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Null value percentage

The following example rules check if more than 95 percent of the values in a column are complete.

Completeness "First_Name" > 0.95

Sample dynamic rules

  • Completeness "colA" between min(last(5)) - 1 and max(last(5)) + 1

  • Completeness "colA" <= avg(last(10))

Null behavior

Note on CSV Data Formats: Blank rows on CSV columns can display multiple behaviors.

  • If a column is of String type, the blank row will be recognized as an empty string and will not fail the Completeness rule.

  • If a column is of another data type like Int, the blank row will be recognized as NULL and will fail the Completeness rule.

CustomSQL

This rule type has been extended to support two use cases:

  • Run a custom SQL statement against a dataset and checks the return value against a given expression.

  • Run a custom SQL statement where you specify a column name in your SELECT statement against which you compare with some condition to get row-level results.

Syntax

CustomSql <SQL_STATEMENT> <EXPRESSION>
  • SQL_STATEMENT – A SQL statement that returns a single numeric value, surrounded by double quotes.

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Custom SQL to retrieve an overall rule outcome

This example rule uses a SQL statement to retrieve the record count for a data set. The rule then checks that the record count is between 10 and 20.

CustomSql "select count(*) from primary" between 10 and 20

Example: Custom SQL to retrieve row-level results

This example rule uses a SQL statement wherein you specify a column name in your SELECT statement against which you compare with some condition to get row level results. A threshold condition expression defines a threshold of how many records should fail for the entire rule to fail. Note that a rule may not contain both a condition and keyword together.

CustomSql "select Name from primary where Age > 18"

or

CustomSql "select Name from primary where Age > 18" with threshold > 3
Important

The primary alias stands in for the name of the data set that you want to evaluate. When you work with visual ETL jobs on the console, primary always represents the DynamicFrame being passed to the EvaluateDataQuality.apply() transform. When you use the AWS Glue Data Catalog to run data quality tasks against a table, primary represents the table.

If you are in AWS Glue Data Catalog, you can also use the actual table names:

CustomSql "select count(*) from database.table" between 10 and 20

You can also join multiple tables to compare different data elements:

CustomSql "select count(*) from database.table inner join database.table2 on id1 = id2" between 10 and 20

In AWS Glue ETL, CustomSQL can identify records that failed the data quality checks. For this to work, you will need to return records that are part of the primary table that you are evaluating data quality. Records that are returned as part of the query are considered successful and records that are not returned are considered failed.

The following rule will ensure that records with age < 100 are identified as successful and records that are above are marked as failed.

CustomSql "select id from primary where age < 100"

This CustomSQL rule will pass when 50% of the records have age > 10 and will also identify records that failed. The records returned by this CustomSQL will be considered passed while the ones not returned will be considered failed.

CustomSQL "select ID, CustomerID from primary where age > 10" with threshold > 0.5

Note: CustomSQL rule will fail if you return records that are not available in the dataset.

DataFreshness

Checks the freshness of data in a column by evaluating the difference between the current time and the values of a date column. You can specify a time-based expression for this rule type to make sure that column values are up to date.

Syntax

DataFreshness <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Date

  • EXPRESSION – A numeric expression in hours or days. You must specify the time unit in your expression.

Example: Data freshness

The following example rules check for data freshness.

DataFreshness "Order_Date" <= 24 hours DataFreshness "Order_Date" between 2 days and 5 days

Null behavior

The DataFreshness rules will fail for rows with NULL values. If the rule fails due to a null value, the failure reason will display the following:

80.00 % of rows passed the threshold

where 20% of the rows that failed include the rows with NULL.

The following example compound rule provides a way to explicitly allow for NULL values:

(DataFreshness "Order_Date" <= 24 hours) OR (ColumnValues "Order_Date" = NULL)

DatasetMatch

Checks if the data in the primary dataset matches the data in a reference dataset. The two datasets are joined using the provided key column mappings. Additional column mappings can be provided should you wish to check for the equality of the data in only those columns. Note that for DataSetMatch to work, your join keys should be be unique and should not be NULL (must be a primary key). If you don’t satisfy these conditions, you will get the error message, “Provided key map not suitable for given data frames”. In cases where you can’t have joined keys that are unique, consider using other ruletypes such as AggregateMatch to match on summary data.

Syntax

DatasetMatch <REFERENCE_DATASET_ALIAS> <JOIN CONDITION WITH MAPPING> <OPTIONAL_MATCH_COLUMN_MAPPINGS> <EXPRESSION>
  • REFERENCE_DATASET_ALIAS – The alias of the reference dataset with which you compare data from the primary dataset.

  • KEY_COLUMN_MAPPINGS – A comma-separated list of column names that form a key in the datasets. If the column names are not the same in both datasets, you must separated them with a ->

  • OPTIONAL_MATCH_COLUMN_MAPPINGS – You can supply this parameter if you want to check for matching data only in certain columns. It uses the same syntax as the key column mappings. If this parameter is not provided, we will match the data in all remaining columns. The remaining, non-key columns must have the same names in both datasets.

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Match set datasets using ID column

The following example rule checks that more than 90% of the primary dataset matches the reference dataset, using the "ID" column to join the two datasets. It compares all columns in this case.

DatasetMatch "reference" "ID" >= 0.9

Example: Match set datasets using multiple key columns

In the following example, the primary dataset and the reference dataset have different names for the key columns. ID_1 and ID_2 together form a composite key in the primary dataset. ID_ref1 and ID_ref2 together forms a composite key in the reference dataset. In this scenario, you can use the special syntax to supply the column names.

DatasetMatch "reference" "ID_1->ID_ref1,ID_ref2->ID_ref2" >= 0.9

Example: Match set datasets using multiple key columns and check that specific column matches

This example builds on the previous example. We want to check that only the column containing the amounts match. This column is named Amount1 in the primary dataset and Amount2 in the reference dataset. You want an exact match.

DatasetMatch "reference" "ID_1->ID_ref1,ID_ref2->ID_ref2" "Amount1->Amount2" >= 0.9

DistinctValuesCount

Checks the number of distinct values in a column against a given expression.

Syntax

DistinctValuesCount <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Distinct column value count

The following example rule checks that the column named State contains more than 3 distinct values.

DistinctValuesCount "State" > 3

Sample dynamic rules

  • DistinctValuesCount "colA" between avg(last(10))-1 and avg(last(10))+1

  • DistinctValuesCount "colA" <= index(last(10),2) + std(last(5))

Entropy

Checks whether the entropy value of a column matches a given expression. Entropy measures the level of information that's contained in a message. Given the probability distribution over values in a column, entropy describes how many bits are required to identify a value.

Syntax

Entropy <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Column entropy

The following example rule checks that the column named Feedback has an entropy value greater than one.

Entropy "Star_Rating" > 1

Sample dynamic rules

  • Entropy "colA" < max(last(10))

  • Entropy "colA" between min(last(10)) and max(last(10))

IsComplete

Checks whether all of the values in a column are complete (non-null).

Syntax

IsComplete <COL_NAME>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

Example: Null values

The following example checks whether all of the values in a column named email are non-null.

IsComplete "email"

Null behavior

Note on CSV Data Formats: Blank rows on CSV columns can display multiple behaviors.

  • If a column is of String type, the blank row will be recognized as an empty string and will not fail the Completeness rule.

  • If a column is of another data type like Int, the blank row will be recognized as NULL and will fail the Completeness rule.

IsPrimaryKey

Checks whether a column contains a primary key. A column contains a primary key if all of the values in the column are unique and complete (non-null).

Syntax

IsPrimaryKey <COL_NAME>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

Example: Primary key

The following example rule checks whether the column named Customer_ID contains a primary key.

IsPrimaryKey "Customer_ID"

Example: Primary key with multiple columns. Any of the following examples are valid.

IsPrimaryKey "colA" "colB" IsPrimaryKey "colA" "colB" "colC" IsPrimaryKey colA "colB" "colC"

IsUnique

Checks whether all of the values in a column are unique, and returns a Boolean value.

Syntax

IsUnique <COL_NAME>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

Example: Unique column values

The following example rule checks whether all of the values in a column named email are unique.

IsUnique "email"

Mean

Checks whether the mean (average) of all the values in a column matches a given expression.

Syntax

Mean <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Average value

The following example rule checks whether the average of all of the values in a column exceeds a threshold.

Mean "Star_Rating" > 3

Sample dynamic rules

  • Mean "colA" > avg(last(10)) + std(last(2))

  • Mean "colA" between min(last(5)) - 1 and max(last(5)) + 1

Null behavior

The Mean rule will ignore rows with NULL values in the calculation of the mean. For example:

+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+

The mean of column units will be (0 + 20 + 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation.

ReferentialIntegrity

Checks to what extent the values of a set of columns in the primary dataset are a subset of the values of a set of columns in a reference dataset.

Syntax

ReferentialIntegrity <PRIMARY_COLS> <REFERENCE_DATASET_COLS> <EXPRESSION>
  • PRIMARY_COLS – A comma-separated list of column names in the primary dataset.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • REFERENCE_DATASET_COLS – This parameter contains two parts separated by a period. The first part is the alias of the reference dataset. The second part is the comma-separated list of column names in the reference dataset enclosed in braces.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Check the referential integrity of a zip code column

The following example rule checks that more than 90% of the values in the zipcode column in the primary dataset, are present in the zipcode column in the reference dataset.

ReferentialIntegrity "zipcode" "reference.zipcode" >= 0.9

Example: Check the referential integrity of the city and state columns

In the following example, columns containing city and state information exist in the primary dataset and the reference dataset. The names of the columns are different in both datasets. The rule checks if the set of values of the columns in the primary dataset is exactly equal to the set of values of the columns in the reference dataset.

ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" = 1.0

Sample dynamic rules

  • ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" > avg(last(10))

  • ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" between min(last(10)) - 1 and max(last(10)) + 1

RowCount

Checks the row count of a dataset against a given expression. In the expression, you can specify the number of rows or a range of rows using operators like > and <.

Syntax

RowCount <EXPRESSION>
  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Row count numeric check

The following example rule checks whether the row count is within a given range.

RowCount between 10 and 100

Sample dynamic rules

RowCount > avg(lats(10)) *0.8

RowCountMatch

Checks the ratio of the row count of the primary dataset and the row count of a reference dataset against the given expression.

Syntax

RowCountMatch <REFERENCE_DATASET_ALIAS> <EXPRESSION>
  • REFERENCE_DATASET_ALIAS – The alias of the reference dataset against which to compare row counts.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Row count check against a reference dataset

The following example rule checks whether the row count of the primary dataset is at least 90% of the row count of the reference dataset.

RowCountMatch "reference" >= 0.9

StandardDeviation

Checks the standard deviation of all of the values in a column against a given expression.

Syntax

StandardDeviation <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Standard deviation

The following example rule checks whether the standard deviation of the values in a column named colA is less than a specified value.

StandardDeviation "Star_Rating" < 1.5

Sample dynamic rules

  • StandardDeviation "colA" > avg(last(10) + 0.1

  • StandardDeviation "colA" between min(last(10)) - 1 and max(last(10)) + 1

Null behavior

The StandardDeviation rule will ignore rows with NULL values in the calculation of standard deviation. For example:

+---+-----------+-----------+ |id |units1 |units2 | +---+-----------+-----------+ |100|0 |0 | |101|null |0 | |102|20 |20 | |103|null |0 | |104|40 |40 | +---+-----------+-----------+

The standard deviation of column units1 will not consider rows 101 and 103 and result to 16.33. The standard deviation for column units2 will result in 16.

Sum

Checks the sum of all the values in a column against a given expression.

Syntax

Sum <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Sum

The following example rule checks whether the sum of all of the values in a column exceeds a given threshold.

Sum "transaction_total" > 500000

Sample dynamic rules

  • Sum "ColA" > avg(last(10))

  • Sum "colA" between min(last(10)) - 1 and max(last(10)) + 1

Null behavior

The Sum rule will ignore rows with NULL values in the calculation of sum. For example:

+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+

The sum of column units will not consider rows 101 and 103 and result to (0 + 20 + 40) = 60.

SchemaMatch

Checks if the schema of the primary dataset matches the schema of a reference dataset. The schema check is done column by column. The schema of two columns match if the names are identical and the types are identical. The order of the columns does not matter.

Syntax

SchemaMatch <REFERENCE_DATASET_ALIAS> <EXPRESSION>
  • REFERENCE_DATASET_ALIAS – The alias of the reference dataset against which to compare schemas.

    Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: SchemaMatch

The following example rule checks whether the schema of the primary dataset exactly matches the schema of a reference dataset.

SchemaMatch "reference" = 1.0

Uniqueness

Checks the percentage of unique values in a column against a given expression. Unique values occur exactly once.

Syntax

Uniqueness <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Uniqueness percentage

The following example rule checks whether the percentage of unique values in a column matches certain numeric criteria.

Uniqueness "email" = 1.0

Sample dynamic rules

  • Uniqueness "colA" between min(last(10)) and max(last(10))

  • Uniqueness "colA" >= avg(last(10))

UniqueValueRatio

Checks the unique value ratio of a column against a given expression. A unique value ratio is the fraction of unique values divided by the number of all distinct values in a column. Unique values occur exactly one time, while distinct values occur at least once.

For example, the set [a, a, b] contains one unique value (b) and two distinct values (a and b). So the unique value ratio of the set is ½ = 0.5.

Syntax

UniqueValueRatio <COL_NAME> <EXPRESSION>
  • COL_NAME – The name of the column that you want to evaluate the data quality rule against.

    Supported column types: Any column type

  • EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Unique value ratio

This example checks the unique value ratio of a column against a range of values.

UniqueValueRatio "test_score" between 0 and 0.5

Sample dynamic rules

  • UniqueValueRatio "colA" > avg(last(10))

  • UniqueValueRatio "colA" <= index(last(10),2) + std(last(5))

DetectAnomalies

Detects anomalies for a given Deequ rule. Every execution of DetectAnomalies rule result in saving evaluated value for the given rule. When there is enough data gathered, anomaly detection algorithm takes all historical data for that given rule and runs anomaly detection. DetectAnomalies rule fails when anomaly is detected. More info about what anomaly was detected can be obtained from Observations.

Syntax

DetectAnomalies <RULE_NAME> <RULE_PARAMETERS>

RULE_NAME – The name of the rule that you want to evaluate and detect anomalies for. Supported rules:

  • "RowCount"

  • "Completeness"

  • "Uniqueness"

  • "Mean"

  • "Sum"

  • "StandardDeviation"

  • "Entropy"

  • "DistinctValuesCount"

  • "UniqueValueRatio"

  • "ColumnLength"

  • "ColumnValues"

  • "ColumnCorrelation"

  • "CustomSql"

RULE_PARAMETERS – some rules require additional parameters to run. Refer to the given rule documentation to see required parameters.

Example: Anomalies for RowCount

For example, if we want to detect RowCount anomalies, we provide RowCount as a rule name.

DetectAnomalies "RowCount"

Example: Anomalies for ColumnLength

For example, if we want to detect ColumnLength anomalies, we provide ColumnLength as a rule name and the column name.

DetectAnomalies "ColumnLength" "id"