Data Quality Definition Language (DQDL) reference
Data Quality Definition Language (DQDL) is a domain specific language that you use to define rules for AWS Glue Data Quality.
This guide introduces key DQDL concepts to help you understand the language. It also provides a reference for DQDL rule types with syntax and examples. Before you use this guide, we recommend that you have familiarity with AWS Glue Data Quality. For more information, see AWS Glue Data Quality.
Note
DynamicRules are only supported in AWS Glue ETL.
Contents
- DQDL syntax
- DQDL rule type reference
- AggregateMatch
- ColumnCorrelation
- ColumnCount
- ColumnDataType
- ColumnExists
- ColumnLength
- ColumnNamesMatchPattern
- ColumnValues
- Completeness
- CustomSQL
- DataFreshness
- DatasetMatch
- DistinctValuesCount
- Entropy
- IsComplete
- IsPrimaryKey
- IsUnique
- Mean
- ReferentialIntegrity
- RowCount
- RowCountMatch
- StandardDeviation
- Sum
- SchemaMatch
- Uniqueness
- UniqueValueRatio
- DetectAnomalies
DQDL syntax
A DQDL document is case sensitive and contains a ruleset, which groups individual data quality rules together. To construct a
ruleset, you must create a list named Rules
(capitalized), delimited by a pair of
square brackets. The list should contain one or more comma-separated DQDL rules like the
following example.
Rules = [ IsComplete "order-id", IsUnique "order-id" ]
Rule structure
The structure of a DQDL rule depends on the rule type. However, DQDL rules generally fit the following format.
<RuleType> <Parameter> <Parameter> <Expression>
RuleType
is the case-sensitive name of the rule type that you want to
configure. For example, IsComplete
, IsUnique
, or
CustomSql
. Rule parameters differ for each rule type. For a complete
reference of DQDL rule types and their parameters, see DQDL rule type reference.
Composite rules
DQDL supports the following logical operators that you can use to combine rules. These rules are called Composite Rules.
- and
-
The logical
and
operator results intrue
if and only if the rules that it connects aretrue
. Otherwise, the combined rule results infalse
. Each rule that you connect with theand
operator must be surrounded by parentheses.The following example uses the
and
operator to combine two DQDL rules.(IsComplete "id") and (IsUnique "id")
- or
-
The logical
or
operator results intrue
if and only if one or more of the rules that it connects aretrue
. Each rule that you connect with theor
operator must be surrounded by parentheses.The following example uses the
or
operator to combine two DQDL rules.(RowCount "id" > 100) or (IsPrimaryKey "id")
You can use the same operator to connect multiple rules, so the following rule combination is allowed.
(Mean "Star_Rating" > 3) and (Mean "Order_Total" > 500) and (IsComplete "Order_Id")
However, you can't combine the logical operators into a single expression. For example, the following combination is not allowed.
(Mean "Star_Rating" > 3) and (Mean "Order_Total" > 500) or (IsComplete "Order_Id")
How Composite rules work
By default, Composite Rules are evaluated as individual rules across the entire dataset or table and then the results are combined. In other words, it evaluates the entire column first and then applies the operator. This default behaviour is explained below with an example:
# Dataset +------+------+ |myCol1|myCol2| +------+------+ | 2| 1| | 0| 3| +------+------+ # Overall outcome +----------------------------------------------------------+-------+ |Rule |Outcome| +----------------------------------------------------------+-------+ |(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)|Failed | +----------------------------------------------------------+-------+
In the above example, AWS Glue Data Quality first evaluates (ColumnValues "myCol1" > 1)
which will result in
a failure. Then it will evaluate (ColumnValues "myCol2" > 2)
which will also fail. The combination of both
results will be noted as FAILED.
However, if you prefer an SQL like behaviour, where you need the entire row to be evaluated, you have to explicitly
set the ruleEvaluation.scope
parameter as shown in additionalOptions
in the code snippet below.
object GlueApp { val datasource = glueContext.getCatalogSource( database="<db>", tableName="<table>", transformationContext="datasource" ).getDynamicFrame() val ruleset = """ Rules = [ (ColumnValues "age" >= 26) OR (ColumnLength "name" >= 4) ] """ val dq_results = EvaluateDataQuality.processRows( frame=datasource, ruleset=ruleset, additionalOptions=JsonOptions(""" { "compositeRuleEvaluation.method":"ROW" } """ ) ) }
Once set, the composite rules will behave as a single rule evaluating the entire row. The following example illustrates this behaviour.
# Row Level outcome +------+------+------------------------------------------------------------+---------------------------+ |myCol1|myCol2|DataQualityRulesPass |DataQualityEvaluationResult| +------+------+------------------------------------------------------------+---------------------------+ |2 |1 |[(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)]|Passed | |0 |3 |[(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)]|Passed | +------+------+------------------------------------------------------------+---------------------------+
Some rules cannot be supported in this feature because their overall outcome rely on thresholds or ratios. They are listed below.
Rules relying on ratios:
-
Completeness
-
DatasetMatch
-
ReferentialIntegrity
-
Uniqueness
Rules dependent on thresholds:
When the following rules include with threshold, they are not supported. However, rules that do not involve
with threshold
remain supported.
-
ColumnDataType
-
ColumnValues
-
CustomSQL
Expressions
Note
The expressions, not
and !=
are not supported in AWS Glue Data Catalog.
If a rule type doesn't produce a Boolean response, you must provide an expression as a parameter in order to create a Boolean response. For example, the following rule checks the mean (average) of all the values in a column against an expression to return a true or false result.
Mean "colA" between 80 and 100
Some rule types such as IsUnique
and IsComplete
already return
a Boolean response.
The following table lists expressions that you can use in DQDL rules.
Supported DQDL expressions | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Expression | Description | Example | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
=x |
Resolves to true if the rule type response is equal to
x . |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
!=x
|
x Resolves to true if the rule type response is not equal to x . |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> x |
Resolves to true if the rule type response is greater than
x . |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< x |
Resolves to true if the rule type response is less than
x . |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>= x |
Resolves to true if the rule type response is greater than or
equal to x . |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<= x |
Resolves to true if the rule type response is less than or equal
to x . |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
between x and y |
Resolves to true if the rule type response falls in a specified
range (exclusive).
Only
use this expression type for numeric and date types. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
not between x and y
|
Resolves to true if the rule type response does not fall in a specified range (inclusive). You should only use this expression type for numeric and date types. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
in [a, b, c, ... ] |
Resolves to true if the rule type response is in the specified
set. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
not in [a, b, c, ... ] |
Resolves to true if the rule type response is not in the specified
set. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
matches /ab+c/i |
Resolves to true if the rule type response matches a regular
expression. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
not matches /ab+c/i |
Resolves to true if the rule type response does not matche a regular
expression. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
now() |
Works only with the ColumnValues rule type to create a date
expression. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
matches/in […]/not matches/not in [...] with threshold |
Specifies the percentage of values that match the rule conditions. Works only
with the ColumnValues , ColumnDataType , and CustomSQL rule types. |
|
Keywords for NULL, EMPTY and WHITESPACES_ONLY
Note
Keywords for NULL, EMPTY and WHITESPACES_ONLY are not supported in AWS Glue Data Catalog.
If you want to validate if a string column has a null, empty or a string with only whitespaces you can use the following keywords:
-
NULL / null – This keyword resolves to true for a
null
value in a string column.ColumnValues "colA" != NULL with threshold > 0.5
would return true if more than 50% of your data does not have null values.(ColumnValues "colA" = NULL) or (ColumnLength "colA" > 5)
would return true for all rows which either have a null value or have length >5. Note that this will require the use of the “compositeRuleEvaluation.method” = “ROW” option. -
EMPTY / empty – This keyword resolves to true for an empty string (“”) value in a string column. Some data formats transform nulls in a string column to empty strings. This keyword helps filter out empty strings in your data.
(ColumnValues "colA" = EMPTY) or (ColumnValues "colA" in ["a", "b"])
would return true if a row is either empty, “a” or “b”. Note that this requires the use of the “compositeRuleEvaluation.method” = “ROW” option. -
WHITESPACES_ONLY / whitespaces_only – This keyword resolves to true for a string with only whitespaces (“ ”) value in a string column.
ColumnValues "colA" not in ["a", "b", WHITESPACES_ONLY]
would return true if a row is neither “a” or “b” nor just whitespaces.Supported rules:
For a numeric or date based expression, if you want to validate if a column has a null you can use the following keywords.
-
NULL / null – This keyword resolves to true for a null value in a string column.
ColumnValues "colA" in [NULL, "2023-01-01"]
would return true if a dates in your column are either2023-01-01
or null.(ColumnValues "colA" = NULL) or (ColumnValues "colA" between 1 and 9)
would return true for all rows which either have a null value or have values between 1 and 9. Note that this will require the use of the “compositeRuleEvaluation.method” = “ROW” option.Supported rules:
Dynamic rules
You can now author dynamic rules to compare current metrics produced by your rules with
their historical values. These historical comparisons are enabled by using the
last()
operator in expressions. For example, the rule RowCount >
last()
will succeed when the number of rows in the current run is greater than the
most recent prior row count for the same dataset. last()
takes an optional
natural number argument describing how many prior metrics to consider; last(k)
where k
>= 1
will reference the last k
metrics.
-
If no data points are available,
last(k)
will return the default value 0.0. -
If fewer than
k
metrics are available,last(k)
will return all prior metrics.
To form valid expressions use last(k)
, where k > 1
requires an aggregation function to reduce multiple historical results
to a single number. For example, RowCount > avg(last(5))
will check whether the
current dataset’s row count is strictly greater than the average of the last five row counts for the same dataset.
RowCount > last(5)
will produce an error because the current dataset row count can't be meaningfully compared to a list.
Supported aggregation functions:
-
avg
-
median
-
max
-
min
-
sum
-
std
(standard deviation) -
abs
(absolute value) -
index(last(k), i)
will allow for selecting thei
th most recent value out of the lastk
.i
is zero-indexed, soindex(last(3), 0)
will return the most recent datapoint andindex(last(3), 3)
will result in an error as there are only three datapoints and we attempt to index the 4th most recent one.
Sample expressions
ColumnCorrelation
ColumnCorrelation "colA" "colB" < avg(last(10))
DistinctValuesCount
DistinctValuesCount "colA" between min(last(10))-1 and max(last(10))+1
Most rule types with numeric conditions or thresholds support dynamic rules; see the provided table, Analyzers and Rules, to determine whether dynamic rules are supported for your rule type.
Analyzers
Note
Analyzers are not supported in AWS Glue Data Catalog.
DQDL rules use functions called analyzers to gather information about your
data. This information is employed by a rule’s Boolean expression to determine whether the rule should succeed or fail.
For example, the RowCount rule RowCount > 5
will use a row count analyzer to discover the number of rows in your dataset,
and compare that count with the expression > 5
to check whether more than five rows exist in the current dataset.
Sometimes, instead of authoring rules, we recommend creating analyzers and then have them generate statistics that can be used to detect anomalies. For such instances, you can create analyzers. Analyzers differ from rules in the following ways.
Characteristic | Analyzers | Rules |
---|---|---|
Part of ruleset | Yes | Yes |
Generates statistics | Yes | Yes |
Generates observations | Yes | Yes |
Can evaluate and assert a condition | No | Yes |
You can configure actions such as stop the jobs on failure, continue processing job | No | Yes |
Analyzers can independently exist without rules, so you can quickly configure them and progressively build data quality rules.
Some rule types can be input in the Analyzers
block of your ruleset to run
the rules required for analyzers and gather information without applying checks for any condition. Some analyzers
aren't associated with rules and can only be input in the Analyzers
block.
The following table indicates whether each item is supported as a rule or
a standalone analyzer, along with additional details for each rule type.
Example Ruleset with Analyzer
The following ruleset uses:
-
a dynamic rule to check whether a dataset is growing over its trailing average for the last three job runs
-
a
DistinctValuesCount
analyzer to record the number of distinct values in the dataset'sName
column -
a
ColumnLength
analyzer to track minimum and maximumName
size over time
Analyzer metric results can be viewed in the Data Quality tab for your job run.
Rules = [ RowCount > avg(last(3)) ], Analyzers = [ DistinctValuesCount "Name", ColumnLength "Name" ]
Comments
You can use the '#' character to add a comment to your DQDL document. Anything after the '#' character and until the end of the line is ignored by DQDL.
Rules = [ # More items should generally mean a higher price, so correlation should be positive ColumnCorrelation "price" "num_items" > 0 ]
DQDL rule type reference
This section provides a reference for each rule type that AWS Glue Data Quality supports.
Note
DQDL doesn't currently support nested or list-type column data.
Bracketed values in the below table will be replaced with the information provided in rule arguments.
Rules typically require an additional argument for expression
Ruletype | Description | Arguments | Reported Metrics | Supported as Rule? | Supported as Analyzer? | Returns row-level Results? | Dynamic rule support? | Generates Observations |
---|---|---|---|---|---|---|---|---|
AggregateMatch | Checks if two datasets match by comparing summary metrics like total sales amount. Useful for financial institutions to compare if all data is ingested from source systems. | One or more aggregations |
When first and second aggregation column names match:
When first and second aggregation column names different:
|
Yes | No | No | No | No |
AllStatistics | Standalone analyzer to gather multiple metrics for the provided column, or all columns in a dataset. | A single column name, OR "AllColumns" |
For columns of all types:
Additional metrics for string-valued columns:
Additional metrics for numeric-valued columns:
|
No | Yes | No | No | No |
ColumnCorrelation | Checks how well two columns are corelated. | Exactly two column names | Multicolumn.[Column1],[Column2].ColumnCorrelation |
Yes | Yes | No | Yes | No |
ColumnCount | Checks if any columns are dropped. | None | Dataset.*.ColumnCount |
Yes | No | No | Yes | Yes |
ColumnDataType | Checks if a column is compliant with a datatype. | Exactly one column name | Column.[Column].ColumnDataType.Compliance |
Yes | No | No | Yes, in row-level threshold expression | No |
ColumnExists | Checks if columns exist in a dataset. This allows customers building self service data platforms to ensure certain columns are made available. | Exactly one column name | N/A | Yes | No | No | No | No |
ColumnLength | Checks if length of data is consistent. | Exactly one column name |
Additional metric when row-level threshold provided:
|
Yes | Yes | Yes, when row-level threshold provided | No | Yes. Only generates observations by analyzing Minimum and Maximum length |
ColumnNamesMatchPattern | Checks if column names match defined patterns. Useful for governance teams to enforce column name consistency. | A regex for column names | Dataset.*.ColumnNamesPatternMatchRatio |
Yes | No | No | No | No |
ColumnValues | Checks if data is consistent per defined values. This rule supports regular expressions. | Exactly one column name |
Additional metric when row-level threshold provided:
|
Yes | No | Yes, when row-level threshold provided | No | Yes. Only generates observations by analyzing Minimum and Maximum values |
Completeness | Checks for any blank or NULLs in data. | Exactly one column name |
|
Yes | Yes | Yes | Yes | Yes |
CustomSql | Customers can implement almost any type of data quality checks in SQL. |
A SQL statement (Optional) A row-level threshold |
Additional metric when row-level threshold provided:
|
Yes | Yes | Yes, when row-level threshold provided | Yes | No |
DataFreshness | Checks if data is fresh. | Exactly one column name | Column.[Column].DataFreshness.Compliance |
Yes | No | Yes | No | No |
DatasetMatch | Compares two datasets and identifies if they are in synch. |
Name of a reference dataset A column mapping (Optional) Columns to check for matches |
Dataset.[ReferenceDatasetAlias].DatasetMatch |
Yes | No | Yes | Yes | No |
DistinctValuesCount | Checks for duplicate values. | Exactly one column name | Column.[Column].DistinctValuesCount |
Yes | Yes | Yes | Yes | Yes |
DetectAnomalies | Checks for anomalies in another rule type's reported metrics. | A rule type | Metric(s) reported by the rule type argument | Yes | No | No | No | No |
Entropy | Checks for entropy of the data. | Exactly one column name | Column.[Column].Entropy |
Yes | Yes | No | Yes | No |
IsComplete | Checks if 100% of the data is complete. | Exactly one column name | Column.[Column].Completeness |
Yes | No | Yes | No | No |
IsPrimaryKey | Checks if a column is a primary key (not NULL and unique). | Exactly one column name |
For single column:
For multiple columns:
|
Yes | No | Yes | No | No |
IsUnique | Checks if 100% of the data is unique. | Exactly one column name | Column.[Column].Uniqueness |
Yes | No | Yes | No | No |
Mean | Checks if the mean matches the set threshold. | Exactly one column name | Column.[Column].Mean |
Yes | Yes | Yes | Yes | No |
ReferentialIntegrity | Checks if two datasets have referential integrity. |
One or more column names from dataset One or more columb names from reference dataset |
Column.[ReferenceDatasetAlias].ReferentialIntegrity |
Yes | No | Yes | Yes | No |
RowCount | Checks if record counts match a threshold. | None | Dataset.*.RowCount |
Yes | Yes | No | Yes | Yes |
RowCountMatch | Checks if record counts between two datasets match. | Reference dataset alias | Dataset.[ReferenceDatasetAlias].RowCountMatch |
Yes | No | No | Yes | No |
StandardDeviation | Checks if standard deviation matches the threshold. | Exactly one column name | Column.[Column].StandardDeviation |
Yes | Yes | Yes | Yes | No |
SchemaMatch | Checks if schema between two datasets match. | Reference dataset alias | Dataset.[ReferenceDatasetAlias].SchemaMatch |
Yes | No | No | Yes | No |
Sum | Checks if sum matches a set threshold. | Exactly one column name | Column.[Column].Sum |
Yes | Yes | No | Yes | No |
Uniqueness | Checks if uniqueness of dataset matches threshold. | Exactly one column name | Column.[Column].Uniqueness |
Yes | Yes | Yes | Yes | No |
UniqueValueRatio | Checks if the unique value ration matches threshold. | Exactly one column name | Column.[Column].UniqueValueRatio |
Yes | Yes | Yes | Yes | No |
Topics
- AggregateMatch
- ColumnCorrelation
- ColumnCount
- ColumnDataType
- ColumnExists
- ColumnLength
- ColumnNamesMatchPattern
- ColumnValues
- Completeness
- CustomSQL
- DataFreshness
- DatasetMatch
- DistinctValuesCount
- Entropy
- IsComplete
- IsPrimaryKey
- IsUnique
- Mean
- ReferentialIntegrity
- RowCount
- RowCountMatch
- StandardDeviation
- Sum
- SchemaMatch
- Uniqueness
- UniqueValueRatio
- DetectAnomalies
AggregateMatch
Checks the ratio of two column aggregations against a given expression. This ruletype works on multiple datasets. The two column aggregations are evaluated and a ratio is produced by dividing the result of the first column aggregation with the result of the second column aggregation. The ratio is checked against the provided expression to produce a boolean response.
Syntax
Column aggregation
ColumnExists
<AGG_OPERATION>
(<OPTIONAL_REFERENCE_ALIAS>
.<COL_NAME>
)
AGG_OPERATION – The operation to use for the aggregation. Currently,
sum
andavg
are supported.Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
OPTIONAL_REFERENCE_ALIAS – This parameter needs to be provided if the column is from a reference dataset and not the primary dataset. If you are using this rule in the AWS Glue Data Catalog, your reference alias must follow the format "<database_name>.<table_name>.<column_name>
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
COL_NAME – The name of the column to aggregate.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Example: Average
"avg(rating)"
Example: Sum
"sum(amount)"
Example: Average of column in reference dataset
"avg(reference.rating)"
Rule
AggregateMatch
<AGG_EXP_1>
<AGG_EXP_2>
<EXPRESSION>
AGG_EXP_1 – The first column aggregation.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
AGG_EXP_2 – The second column aggregation.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Aggregate Match using sum
The following example rule checks whether the sum of the values in the amount
column is exactly equal to the sum of the values in the total_amount
column.
AggregateMatch "sum(amount)" "sum(total_amount)" = 1.0
Example: Aggregate Match using average
The following example rule checks whether the average of the values in the ratings
column is equal to at least 90% of the average of the values in the ratings
column in the reference
dataset. The reference dataset is provided as an additional data source in the ETL or Data Catalog experience.
In AWS Glue ETL, you can use:
AggregateMatch "avg(ratings)" "avg(reference.ratings)" >= 0.9
In the AWS Glue Data Catalog, you can use:
AggregateMatch "avg(ratings)" "avg(database_name.tablename.ratings)" >= 0.9
Null behavior
The AggregateMatch
rule will ignore rows with NULL values in the calculation of the
aggregation methods (sum/mean). For example:
+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+
The mean of column units
will be (0 + 20 + 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation.
ColumnCorrelation
Checks the correlation between two columns against a given expression. AWS Glue Data Quality uses the Pearson correlation coefficient to measure the linear correlation between two columns. The result is a number between -1 and 1 that measures the strength and direction of the relationship.
Syntax
ColumnCorrelation
<COL_1_NAME>
<COL_2_NAME>
<EXPRESSION>
COL_1_NAME – The name of the first column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
COL_2_NAME – The name of the second column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Column correlation
The following example rule checks whether the correlation coefficient between the
columns height
and weight
has a strong positive correlation (a
coefficient value greater than 0.8).
ColumnCorrelation "height" "weight" > 0.8
Sample dynamic rules
ColumnCorrelation "colA" "colB" between min(last(10)) and max(last(10))
ColumnCorrelation "colA" "colB" < avg(last(5)) + std(last(5))
Null behavior
The ColumnCorrelation
rule will ignore rows with NULL
values in the calculation of the
correlation. For example:
+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+
Rows 101 and 103 will be ignored, and the ColumnCorrelation
will be 1.0.
ColumnCount
Checks the column count of the primary dataset against a given expression. In the expression, you can specify the number of columns or a range of columns using operators like >
and <
.
Syntax
ColumnCount
<EXPRESSION>
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Column count numeric check
The following example rule checks whether the column count is within a given range.
ColumnCount between 10 and 20
Sample dynamic rules
ColumnCount >= avg(last(10))
ColumnCount between min(last(10))-1 and max(last(10))+1
ColumnDataType
Checks the inherent data type of the values in a given column against the provided expected type. Accepts a with threshold
expression to check for a subset of the values in the column.
Syntax
ColumnDataType
<COL_NAME>
=<EXPECTED_TYPE>
ColumnDataType<COL_NAME>
=<EXPECTED_TYPE>
with threshold<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: String type
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPECTED_TYPE – The expected type of the values in the column.
Supported values: Boolean, Date, Timestamp, Integer, Double, Float, Long
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An optional expression to specify the percentage of values that should be of the expected type.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Example: Column data type integers as strings
The following example rule checks whether the values in the given column, which is of type string, are actually integers.
ColumnDataType "colA" = "INTEGER"
Example: Column data type integers as strings check for a subset of the values
The following example rule checks whether more than 90% of the values in the given column, which is of type string, are actually integers.
ColumnDataType "colA" = "INTEGER" with threshold > 0.9
ColumnExists
Checks whether a column exists.
Syntax
ColumnExists
<COL_NAME>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
Example: Column exists
The following example rule checks whether the column named Middle_Name
exists.
ColumnExists "Middle_Name"
ColumnLength
Checks whether the length of each row in a column conforms to a given expression.
Syntax
ColumnLength
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: String
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Column row length
The following example rule checks whether the value in each row in the column named
Postal_Code
is 5 characters long.
ColumnLength "Postal_Code" = 5
Null behavior
The ColumnLength
rule treats NULL
s as 0 length strings. For a NULL
row:
ColumnLength "Postal_Code" > 4 # this will fail
ColumnLength "Postal_Code" < 6 # this will succeed
The following example compound rule provides a way to explicitly fail NULL
values:
(ColumnLength "Postal_Code" > 4) AND (ColumnValues != NULL)
ColumnNamesMatchPattern
Checks whether the names of all columns in the primary dataset match the given regular expression.
Syntax
ColumnNamesMatchPattern
<PATTERN>
PATTERN – The pattern you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Example: Column names match pattern
The following example rule checks whether all columns start with the prefix "aws_"
ColumnNamesMatchPattern "aws_.*"
ColumnValues
Runs an expression against the values in a column.
Syntax
ColumnValues
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Allowed values
The following example rule checks whether each value in the specified column is in a set of allowed values (including null, empty, and strings with only whitespaces).
ColumnValues "Country" in [ "US", "CA", "UK", NULL, EMPTY, WHITESPACES_ONLY ]
Example: Regular expression
The following example rule checks the values in a column against a regular expression.
ColumnValues "First_Name" matches "[a-ZA-Z]*"
Example: Date values
The following example rule checks the values in a date column against a date expression.
ColumnValues "Load_Date" > (now() - 3 days)
Example: Numeric values
The following example rule checks whether the column values match a certain numeric constraint.
ColumnValues "Customer_ID" between 1 and 2000
Null behavior
For all ColumnValues
rules (other than !=
and NOT IN
), NULL
rows will fail the rule. If the rule fails due to a null value, the failure reason will display the following:
Value: NULL does not meet the constraint requirement!
The following example compound rule provides a way to explicitly allow for NULL
values:
(ColumnValues "Age" > 21) OR (ColumnValues "Age" = NULL)
Negated ColumnValues rules using the !=
and not in
syntax will pass for NULL
rows.
For example:
ColumnValues "Age" != 21
ColumnValues "Age" not in [21, 22, 23]
The following examples provide a way to explicitly fail NULL
values
(ColumnValues "Age" != 21) AND (ColumnValues "Age" != NULL)
ColumnValues "Age" not in [21, 22, 23, NULL]
Completeness
Checks the percentage of complete (non-null) values in a column against a given expression.
Syntax
Completeness
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Null value percentage
The following example rules check if more than 95 percent of the values in a column are complete.
Completeness "First_Name" > 0.95
Sample dynamic rules
Completeness "colA" between min(last(5)) - 1 and max(last(5)) + 1
Completeness "colA" <= avg(last(10))
Null behavior
Note on CSV Data Formats: Blank rows on CSV columns can display multiple behaviors.
-
If a column is of
String
type, the blank row will be recognized as an empty string and will not fail theCompleteness
rule. -
If a column is of another data type like
Int
, the blank row will be recognized asNULL
and will fail theCompleteness
rule.
CustomSQL
This rule type has been extended to support two use cases:
Run a custom SQL statement against a dataset and checks the return value against a given expression.
Run a custom SQL statement where you specify a column name in your SELECT statement against which you compare with some condition to get row-level results.
Syntax
CustomSql
<SQL_STATEMENT>
<EXPRESSION>
SQL_STATEMENT – A SQL statement that returns a single numeric value, surrounded by double quotes.
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Custom SQL to retrieve an overall rule outcome
This example rule uses a SQL statement to retrieve the record count for a data set. The rule then checks that the record count is between 10 and 20.
CustomSql "select count(*) from primary" between 10 and 20
Example: Custom SQL to retrieve row-level results
This example rule uses a SQL statement wherein you specify a column name in your SELECT statement against which you compare with some condition to get row level results. A threshold condition expression defines a threshold of how many records should fail for the entire rule to fail. Note that a rule may not contain both a condition and keyword together.
CustomSql "select Name from primary where Age > 18"
or
CustomSql "select Name from primary where Age > 18" with threshold > 3
Important
The primary
alias stands in for the name of the data set that you want to
evaluate. When you work with visual ETL jobs on the console, primary
always
represents the DynamicFrame
being passed to the
EvaluateDataQuality.apply()
transform. When you use the AWS Glue Data Catalog to run
data quality tasks against a table, primary
represents the table.
If you are in AWS Glue Data Catalog, you can also use the actual table names:
CustomSql "select count(*) from database.table" between 10 and 20
You can also join multiple tables to compare different data elements:
CustomSql "select count(*) from database.table inner join database.table2 on id1 = id2" between 10 and 20
In AWS Glue ETL, CustomSQL can identify records that failed the data quality checks. For this to work, you will need to return records that are part of the primary table that you are evaluating data quality. Records that are returned as part of the query are considered successful and records that are not returned are considered failed.
The following rule will ensure that records with age < 100 are identified as successful and records that are above are marked as failed.
CustomSql "select id from primary where age < 100"
This CustomSQL rule will pass when 50% of the records have age > 10 and will also identify records that failed. The records returned by this CustomSQL will be considered passed while the ones not returned will be considered failed.
CustomSQL "select ID, CustomerID from primary where age > 10" with threshold > 0.5
Note: CustomSQL rule will fail if you return records that are not available in the dataset.
DataFreshness
Checks the freshness of data in a column by evaluating the difference between the current time and the values of a date column. You can specify a time-based expression for this rule type to make sure that column values are up to date.
Syntax
DataFreshness
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Date
EXPRESSION – A numeric expression in hours or days. You must specify the time unit in your expression.
Example: Data freshness
The following example rules check for data freshness.
DataFreshness "Order_Date" <= 24 hours DataFreshness "Order_Date" between 2 days and 5 days
Null behavior
The DataFreshness
rules will fail for rows with NULL
values. If the rule fails due to a null value,
the failure reason will display the following:
80.00 % of rows passed the threshold
where 20% of the rows that failed include the rows with NULL
.
The following example compound rule provides a way to explicitly allow for NULL
values:
(DataFreshness "Order_Date" <= 24 hours) OR (ColumnValues "Order_Date" = NULL)
DatasetMatch
Checks if the data in the primary dataset matches the data in a reference dataset. The two datasets are joined using the provided key column mappings. Additional column mappings can be provided should you wish to check for the equality of the data in only those columns. Note that for DataSetMatch to work, your join keys should be be unique and should not be NULL (must be a primary key). If you don’t satisfy these conditions, you will get the error message, “Provided key map not suitable for given data frames”. In cases where you can’t have joined keys that are unique, consider using other ruletypes such as AggregateMatch to match on summary data.
Syntax
DatasetMatch
<REFERENCE_DATASET_ALIAS>
<JOIN CONDITION WITH MAPPING>
<OPTIONAL_MATCH_COLUMN_MAPPINGS>
<EXPRESSION>
REFERENCE_DATASET_ALIAS – The alias of the reference dataset with which you compare data from the primary dataset.
KEY_COLUMN_MAPPINGS – A comma-separated list of column names that form a key in the datasets. If the column names are not the same in both datasets, you must separated them with a
->
OPTIONAL_MATCH_COLUMN_MAPPINGS – You can supply this parameter if you want to check for matching data only in certain columns. It uses the same syntax as the key column mappings. If this parameter is not provided, we will match the data in all remaining columns. The remaining, non-key columns must have the same names in both datasets.
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Match set datasets using ID column
The following example rule checks that more than 90% of the primary dataset matches the reference dataset, using the "ID" column to join the two datasets. It compares all columns in this case.
DatasetMatch "reference" "ID" >= 0.9
Example: Match set datasets using multiple key columns
In the following example, the primary dataset and the reference dataset have different names for the key columns.
ID_1
and ID_2
together form a composite key in the primary dataset.
ID_ref1
and ID_ref2
together forms a composite key in the reference dataset.
In this scenario, you can use the special syntax to supply the column names.
DatasetMatch "reference" "ID_1->ID_ref1,ID_ref2->ID_ref2" >= 0.9
Example: Match set datasets using multiple key columns and check that specific column matches
This example builds on the previous example. We want to check that only the column containing the amounts match. This column is named Amount1
in the primary dataset and Amount2
in the reference dataset. You want an exact match.
DatasetMatch "reference" "ID_1->ID_ref1,ID_ref2->ID_ref2" "Amount1->Amount2" >= 0.9
DistinctValuesCount
Checks the number of distinct values in a column against a given expression.
Syntax
DistinctValuesCount
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Distinct column value count
The following example rule checks that the column named State
contains more
than 3 distinct values.
DistinctValuesCount "State" > 3
Sample dynamic rules
DistinctValuesCount "colA" between avg(last(10))-1 and avg(last(10))+1
DistinctValuesCount "colA" <= index(last(10),2) + std(last(5))
Entropy
Checks whether the entropy value of a column matches a given expression. Entropy measures the level of information that's contained in a message. Given the probability distribution over values in a column, entropy describes how many bits are required to identify a value.
Syntax
Entropy
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Column entropy
The following example rule checks that the column named Feedback
has an
entropy value greater than one.
Entropy "Star_Rating" > 1
Sample dynamic rules
Entropy "colA" < max(last(10))
Entropy "colA" between min(last(10)) and max(last(10))
IsComplete
Checks whether all of the values in a column are complete (non-null).
Syntax
IsComplete
<COL_NAME>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
Example: Null values
The following example checks whether all of the values in a column named
email
are non-null.
IsComplete "email"
Null behavior
Note on CSV Data Formats: Blank rows on CSV columns can display multiple behaviors.
-
If a column is of
String
type, the blank row will be recognized as an empty string and will not fail theCompleteness
rule. -
If a column is of another data type like
Int
, the blank row will be recognized asNULL
and will fail theCompleteness
rule.
IsPrimaryKey
Checks whether a column contains a primary key. A column contains a primary key if all of the values in the column are unique and complete (non-null).
Syntax
IsPrimaryKey
<COL_NAME>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
Example: Primary key
The following example rule checks whether the column named Customer_ID
contains a primary key.
IsPrimaryKey "Customer_ID"
Example: Primary key with multiple columns. Any of the following examples are valid.
IsPrimaryKey "colA" "colB" IsPrimaryKey "colA" "colB" "colC" IsPrimaryKey colA "colB" "colC"
IsUnique
Checks whether all of the values in a column are unique, and returns a Boolean value.
Syntax
IsUnique
<COL_NAME>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
Example: Unique column values
The following example rule checks whether all of the values in a column named
email
are unique.
IsUnique "email"
Mean
Checks whether the mean (average) of all the values in a column matches a given expression.
Syntax
Mean
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Average value
The following example rule checks whether the average of all of the values in a column exceeds a threshold.
Mean "Star_Rating" > 3
Sample dynamic rules
Mean "colA" > avg(last(10)) + std(last(2))
Mean "colA" between min(last(5)) - 1 and max(last(5)) + 1
Null behavior
The Mean
rule will ignore rows with NULL
values in the calculation of the mean. For example:
+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+
The mean of column units
will be (0 + 20 + 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation.
ReferentialIntegrity
Checks to what extent the values of a set of columns in the primary dataset are a subset of the values of a set of columns in a reference dataset.
Syntax
ReferentialIntegrity
<PRIMARY_COLS>
<REFERENCE_DATASET_COLS>
<EXPRESSION>
PRIMARY_COLS – A comma-separated list of column names in the primary dataset.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
REFERENCE_DATASET_COLS – This parameter contains two parts separated by a period. The first part is the alias of the reference dataset. The second part is the comma-separated list of column names in the reference dataset enclosed in braces.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Check the referential integrity of a zip code column
The following example rule checks that more than 90% of the values in the zipcode
column in the primary dataset, are present in the zipcode
column in the reference
dataset.
ReferentialIntegrity "zipcode" "reference.zipcode" >= 0.9
Example: Check the referential integrity of the city and state columns
In the following example, columns containing city and state information exist in the primary dataset and the reference dataset. The names of the columns are different in both datasets. The rule checks if the set of values of the columns in the primary dataset is exactly equal to the set of values of the columns in the reference dataset.
ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" = 1.0
Sample dynamic rules
ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" > avg(last(10))
ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" between min(last(10)) - 1 and max(last(10)) + 1
RowCount
Checks the row count of a dataset against a given expression. In the expression, you can
specify the number of rows or a range of rows using operators like >
and
<
.
Syntax
RowCount
<EXPRESSION>
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Row count numeric check
The following example rule checks whether the row count is within a given range.
RowCount between 10 and 100
Sample dynamic rules
RowCount > avg(lats(10)) *0.8
RowCountMatch
Checks the ratio of the row count of the primary dataset and the row count of a reference dataset against the given expression.
Syntax
RowCountMatch
<REFERENCE_DATASET_ALIAS>
<EXPRESSION>
REFERENCE_DATASET_ALIAS – The alias of the reference dataset against which to compare row counts.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Row count check against a reference dataset
The following example rule checks whether the row count of the primary dataset is at least 90% of the row count of the reference dataset.
RowCountMatch "reference" >= 0.9
StandardDeviation
Checks the standard deviation of all of the values in a column against a given expression.
Syntax
StandardDeviation
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Standard deviation
The following example rule checks whether the standard deviation of the values in a
column named colA
is less than a specified value.
StandardDeviation "Star_Rating" < 1.5
Sample dynamic rules
StandardDeviation "colA" > avg(last(10) + 0.1
StandardDeviation "colA" between min(last(10)) - 1 and max(last(10)) + 1
Null behavior
The StandardDeviation
rule will ignore rows with NULL
values in the calculation of
standard deviation. For example:
+---+-----------+-----------+ |id |units1 |units2 | +---+-----------+-----------+ |100|0 |0 | |101|null |0 | |102|20 |20 | |103|null |0 | |104|40 |40 | +---+-----------+-----------+
The standard deviation of column units1
will not consider rows 101 and 103 and result to 16.33.
The standard deviation for column units2
will result in 16.
Sum
Checks the sum of all the values in a column against a given expression.
Syntax
Sum
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Sum
The following example rule checks whether the sum of all of the values in a column exceeds a given threshold.
Sum "transaction_total" > 500000
Sample dynamic rules
Sum "ColA" > avg(last(10))
Sum "colA" between min(last(10)) - 1 and max(last(10)) + 1
Null behavior
The Sum
rule will ignore rows with NULL
values in the calculation of sum. For example:
+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+
The sum of column units
will not consider rows 101 and 103 and result to (0 + 20 + 40) = 60.
SchemaMatch
Checks if the schema of the primary dataset matches the schema of a reference dataset. The schema check is done column by column. The schema of two columns match if the names are identical and the types are identical. The order of the columns does not matter.
Syntax
SchemaMatch
<REFERENCE_DATASET_ALIAS>
<EXPRESSION>
REFERENCE_DATASET_ALIAS – The alias of the reference dataset against which to compare schemas.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: SchemaMatch
The following example rule checks whether the schema of the primary dataset exactly matches the schema of a reference dataset.
SchemaMatch "reference" = 1.0
Uniqueness
Checks the percentage of unique values in a column against a given expression. Unique values occur exactly once.
Syntax
Uniqueness
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Uniqueness percentage
The following example rule checks whether the percentage of unique values in a column matches certain numeric criteria.
Uniqueness "email" = 1.0
Sample dynamic rules
Uniqueness "colA" between min(last(10)) and max(last(10))
Uniqueness "colA" >= avg(last(10))
UniqueValueRatio
Checks the unique value ratio of a column against a given expression. A unique value ratio is the fraction of unique values divided by the number of all distinct values in a column. Unique values occur exactly one time, while distinct values occur at least once.
For example, the set [a, a, b]
contains one unique value (b
)
and two distinct values (a
and b
). So the unique value ratio of the
set is ½ = 0.5.
Syntax
UniqueValueRatio
<COL_NAME>
<EXPRESSION>
COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Unique value ratio
This example checks the unique value ratio of a column against a range of values.
UniqueValueRatio "test_score" between 0 and 0.5
Sample dynamic rules
UniqueValueRatio "colA" > avg(last(10))
UniqueValueRatio "colA" <= index(last(10),2) + std(last(5))
DetectAnomalies
Detects anomalies for a given Deequ rule. Every execution of DetectAnomalies rule result in saving evaluated value for the given rule. When there is enough data gathered, anomaly detection algorithm takes all historical data for that given rule and runs anomaly detection. DetectAnomalies rule fails when anomaly is detected. More info about what anomaly was detected can be obtained from Observations.
Syntax
DetectAnomalies
<RULE_NAME>
<RULE_PARAMETERS>
RULE_NAME
– The name of the rule that you want to evaluate and detect anomalies for.
Supported rules:
"RowCount"
"Completeness"
"Uniqueness"
"Mean"
"Sum"
"StandardDeviation"
"Entropy"
"DistinctValuesCount"
"UniqueValueRatio"
"ColumnLength"
"ColumnValues"
"ColumnCorrelation"
"CustomSql"
RULE_PARAMETERS
– some rules require additional parameters to run. Refer to the given rule documentation to see
required parameters.
Example: Anomalies for RowCount
For example, if we want to detect RowCount anomalies, we provide RowCount as a rule name.
DetectAnomalies "RowCount"
Example: Anomalies for ColumnLength
For example, if we want to detect ColumnLength anomalies, we provide ColumnLength as a rule name and the column name.
DetectAnomalies "ColumnLength" "id"