Available checks - AWS Glue DataBrew

Available checks

The following table lists references for all available conditions that can be used in your rules. Note that aggregated conditions cannot be combined with non-aggregated conditions in the same rule.

Note

For SDK users, to apply the same rule to multiple columns use the ColumnSelectors attribute of a Rule and specify validated columns using either their names or a regular expression. In this case, you should use implicit CheckExpression. For example, “> :val” to compare values in each of the selected columns with the provided value. DataBrew uses implicit syntax for defining FilterExpression in dynamic datasets. If you want to specify column(s) for each check individually, don't set the ColumnSelectors attribute. Instead, provide an explicit expression. For example, “:col > :val” as a CheckExpression in a Rule.

Condition type Data quality check Additional parameters Comparison type SDK syntax example
Aggregate dataset conditions Number of rows Numeric comparison against custom value

"CheckExpression": "AGG(ROWS_COUNT) > :val", "SubstitutionMap": {":val", "10000"}

Number of columns Numeric comparison against custom value

"CheckExpression": "AGG(COLUMNS_COUNT) == :val", "SubstitutionMap": {":val", "20"}

Duplicate rows

Numeric comparison against custom value

"CheckExpression": "AGG(DUPLICATE_ROWS_COUNT) < :val", "SubstitutionMap": {":val", "100"}

or

"CheckExpression": "AGG(DUPLICATE_ROWS_PERCENTAGE) < :val", "SubstitutionMap": {":val", "5"}

Aggregate column statistics conditions Missing values Numeric comparison against custom value

"CheckExpression": "AGG(MISSING_VALUES_COUNT) < :val", "SubstitutionMap": {":val", "100"}

or

"CheckExpression": "AGG(MISSING_VALUES_PERCENTAGE) < :val", "SubstitutionMap": {":val", "5"}

Duplicate values Numeric comparison against custom value

"CheckExpression": "AGG(DUPLICATE_VALUES_COUNT) < :val", "SubstitutionMap": {":val", "100"}

or

"CheckExpression": "AGG(DUPLICATE_VALUES_PERCENTAGE) < :val", "SubstitutionMap": {":val", "5"}

Valid values

Numeric comparison against custom value

"CheckExpression": "AGG(VALID_VALUES_COUNT) > :val", "SubstitutionMap": {":val", "10000"}

or

"CheckExpression": "AGG(VALID_VALUES_PERCENTAGE) > :val", "SubstitutionMap": {":val", "95"}

Distinct values Numeric comparison against custom value

"CheckExpression": "AGG(DISTINCT_VALUES_COUNT) > :val", "SubstitutionMap": {":val", "1000"}

or

"CheckExpression": "AGG(DISTINCT_VALUES_PERCENTAGE) >= :val", "SubstitutionMap": {":val", "50"}

Unique values

Numeric comparison against custom value

"CheckExpression": "AGG(UNIQUE_VALUES_COUNT) > :val", "SubstitutionMap": {":val", "100"}

or

"CheckExpression": "AGG(UNIQUE_VALUES_PERCENTAGE) > :val", "SubstitutionMap": {":val", "20"}

Outliers

Z-score threshold Numeric comparison against custom value

"CheckExpression": "AGG(Z_SCORE_OUTLIERS_COUNT, :zscore_dev) < :val", "SubstitutionMap": {":zscore_dev": "4", ":val", "100"}

or

"CheckExpression": "AGG(Z_SCORE_OUTLIERS_PERCENTAGE) < :val", "SubstitutionMap": {":val", "5"}

Value distribution statistics

Statistics name (see next table) Numeric comparison against custom value

"CheckExpression": "AGG(<STAT_NAME>) < :val", "SubstitutionMap": {":val", "100"}

or

"CheckExpression": "AGG(<STAT_NAME>, :param) < :val", "SubstitutionMap": {":param": "0.25", :val", "5"}

Note

See next table for possible STAT_NAME values

Numerical statistics

Statistics name (see next table) Numeric comparison against custom value

"CheckExpression": "AGG(<STAT_NAME>) < :val", "SubstitutionMap": {":val", "100"}

or

"CheckExpression": "AGG(<STAT_NAME>, :param) < :val", "SubstitutionMap": {":param": "0.25", :val", "5"}

Note

See next table for possible STAT_NAME values

Non aggregate (accepts threshold) Value is exactly Exact comparison against a list of values

"CheckExpression": ":col IN :list", "SubstitutionMap": {":col": "`size`", ":list": "[\"S\",\"M\",\"L\",\"XL\"]"}

Value is not exactly Value shouldn't exactly match any value from a list

"CheckExpression": ":col NOT IN :list", "SubstitutionMap": {":col": "`domain`", ":list": "[\"GOV\",\"ORG\"]"}

String values String comparison against custom value or other string column

"CheckExpression": ":col STARTS_WITH :val", "SubstitutionMap": {":col": "`url`", ":val": "http"}

or

"CheckExpression": ":col1 contains :col2", "SubstitutionMap": {":col1": "`url`", ":col2": "`company_name`"}

Numeric values Numeric comparison against custom value or other numeric column

"CheckExpression": ":col IS_BETWEEN :val1 and :val2", "SubstitutionMap": {":col": "`APY`", ":val1": "0", ":val2": "10"}

or

"CheckExpression": ":col1 <= :col2", "SubstitutionMap": {":col1": "`bank_rate`", ":col2": "`fed_rate`"}

Value string length Numeric comparison against custom value or other numeric column

"CheckExpression": "length(:col) IS_BETWEEN :val1 and :val2", "SubstitutionMap": {":col": "`identifier`", ":val1": "8", ":val2": "12"}

or

"CheckExpression": "length(:col1) <= :col2", "SubstitutionMap": {":col1": "`name`", ":col2": "`max_name_len`"}

Numeric comparisons

DataBrew supports the following operations for numeric comparison: Is equals (==), Is not equals (!=), Less than (<), Less than equals (<=), Greater than (>), Greater than equals (>=) and Is between (is_between :val1 and :val2).

String comparisons

The following string comparisons are supported: Starts with, Doesn’t start with, Ends with, Doesn’t end with, Contains, Doesn’t contain, Is equals, Is not equals, Matches, Doesn’t match.

The following table displays available statistics that you can use for Value distribution statistics and Numerical statistics:

Data quality check Statistics name Additional parameters SDK syntax
Value distribution statistics Min "CheckExpression": "AGG(MAX) < :val", "SubstitutionMap": {":val", "100"}
Max "CheckExpression": "AGG(MIN) > :val", "SubstitutionMap": {":val", "0"}
Median "CheckExpression": "AGG(MEDIAN) >= :val", "SubstitutionMap": {":val", "50"}
Mean "CheckExpression": "AGG(MEAN) <= :val", "SubstitutionMap": {":val", "10"}
Mode "CheckExpression": "AGG(MODE) > :val", "SubstitutionMap": {":val", "0"}
Standard deviation "CheckExpression": "AGG(STANDARD_DEVIATION) > :val", "SubstitutionMap": {":val", "0"}
Entropy "CheckExpression": "AGG(ENTROPY) > :val", "SubstitutionMap": {":val", "0"}
Numerical statistics Sum "CheckExpression": "AGG(SUM) > :val", "SubstitutionMap": {":val", "0"}
Kurtosis "CheckExpression": "AGG(KURTOSIS) > :val", "SubstitutionMap": {":val", "0"}
Skewness "CheckExpression": "AGG(SKEWNESS) > :val", "SubstitutionMap": {":val", "0"}
Variance "CheckExpression": "AGG(VARIANCE) > :val", "SubstitutionMap": {":val", "0"}
Absolute deviation "CheckExpression": "AGG(MEDIAN_ABSOLUTE_DEVIATION) > :val", "SubstitutionMap": {":val", "0"}
Quantile Quantile: one of '0.25', '0.5', '0.75' "CheckExpression": "AGG(QUANTILE, :pct) > :val", "SubstitutionMap": {":pct": "0.25", ":val", "0"}