Data Quality Definition Language (DQDL) reference

Data Quality Definition Language (DQDL) is a domain specific language that you use to define rules for AWS Glue Data Quality.

This guide introduces key DQDL concepts to help you understand the language. It also provides a reference for DQDL rule types with syntax and examples. Before you use this guide, we recommend that you have familiarity with AWS Glue Data Quality. For more information, see AWS Glue Data Quality.

Note

DynamicRules are only supported in AWS Glue ETL.

DQDL syntax

A DQDL document is case sensitive and contains a ruleset, which groups individual data quality rules together. To construct a ruleset, you must create a list named Rules (capitalized), delimited by a pair of square brackets. The list should contain one or more comma-separated DQDL rules like the following example.


Rules = [
   IsComplete "order-id",
   IsUnique "order-id"
]

Rule structure

The structure of a DQDL rule depends on the rule type. However, DQDL rules generally fit the following format.


<RuleType> <Parameter> <Parameter> <Expression>

RuleType is the case-sensitive name of the rule type that you want to configure. For example, IsComplete, IsUnique, or CustomSql. Rule parameters differ for each rule type. For a complete reference of DQDL rule types and their parameters, see DQDL rule type reference.

Composite rules

DQDL supports the following logical operators that you can use to combine rules. These rules are called Composite Rules.

and

The logical and operator results in true if and only if the rules that it connects are true. Otherwise, the combined rule results in false. Each rule that you connect with the and operator must be surrounded by parentheses.

The following example uses the and operator to combine two DQDL rules.


(IsComplete "id") and (IsUnique "id")

or

The logical or operator results in true if and only if one or more of the rules that it connects are true. Each rule that you connect with the or operator must be surrounded by parentheses.

The following example uses the or operator to combine two DQDL rules.


(RowCount "id" > 100) or (IsPrimaryKey "id")

You can use the same operator to connect multiple rules, so the following rule combination is allowed.


(Mean "Star_Rating" > 3) and (Mean "Order_Total" > 500) and (IsComplete "Order_Id")

However, you can't combine the logical operators into a single expression. For example, the following combination is not allowed.


(Mean "Star_Rating" > 3) and (Mean "Order_Total" > 500) or (IsComplete "Order_Id")

How Composite rules work

By default, Composite Rules are evaluated as individual rules across the entire dataset or table and then the results are combined. In other words, it evaluates the entire column first and then applies the operator. This default behaviour is explained below with an example:


# Dataset

+------+------+
|myCol1|myCol2|
+------+------+
|     2|     1|
|     0|     3|
+------+------+

# Overall outcome

+----------------------------------------------------------+-------+
|Rule                                                      |Outcome|
+----------------------------------------------------------+-------+
|(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)|Failed |
+----------------------------------------------------------+-------+

In the above example, AWS Glue Data Quality first evaluates (ColumnValues "myCol1" > 1) which will result in a failure. Then it will evaluate (ColumnValues "myCol2" > 2) which will also fail. The combination of both results will be noted as FAILED.

However, if you prefer an SQL like behaviour, where you need the entire row to be evaluated, you have to explicitly set the ruleEvaluation.scope parameter as shown in additionalOptions in the code snippet below.


object GlueApp {
  val datasource = glueContext.getCatalogSource(
    database="<db>",
    tableName="<table>",
    transformationContext="datasource"
  ).getDynamicFrame()

  val ruleset = """
    Rules = [
        (ColumnValues "age" >= 26) OR (ColumnLength "name" >= 4)        
    ]
  """

  val dq_results = EvaluateDataQuality.processRows(
    frame=datasource,
    ruleset=ruleset,
    additionalOptions=JsonOptions("""
        {
          "compositeRuleEvaluation.method":"ROW"
        }
      """
    )
  )
}

In AWS Glue Studio and AWS Glue Data Catalog, you can easily setup this option in the user interface as shown below.

The screenshot shows a composite rule settings window where you can choose the rule evaluation configuration between row and column. If you choose Row, the composite rules will behave as a single rule evaluating entire row. If you choose Column, the composite rules will evaluate individual rules across the entire dataset and combine the results.

Once set, the composite rules will behave as a single rule evaluating the entire row. The following example illustrates this behaviour.




# Row Level outcome

+------+------+------------------------------------------------------------+---------------------------+
|myCol1|myCol2|DataQualityRulesPass                                        |DataQualityEvaluationResult|
+------+------+------------------------------------------------------------+---------------------------+
|2     |1     |[(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)]|Passed                     |
|0     |3     |[(ColumnValues "myCol1" > 1) OR (ColumnValues "myCol2" > 2)]|Passed                     |
+------+------+------------------------------------------------------------+---------------------------+

Some rules cannot be supported in this feature because their overall outcome rely on thresholds or ratios. They are listed below.

Rules relying on ratios:

Completeness
DatasetMatch
ReferentialIntegrity
Uniqueness

Rules dependent on thresholds:

When the following rules include with threshold, they are not supported. However, rules that do not involve with threshold remain supported.

ColumnDataType
ColumnValues
CustomSQL

Expressions

If a rule type doesn't produce a Boolean response, you must provide an expression as a parameter in order to create a Boolean response. For example, the following rule checks the mean (average) of all the values in a column against an expression to return a true or false result.


Mean "colA" between 80 and 100

Some rule types such as IsUnique and IsComplete already return a Boolean response.

The following table lists expressions that you can use in DQDL rules.

Supported DQDL expressions
Expression	Description	Example
=`x`	Resolves to `true` if the rule type response is equal to `x`.	`Completeness "colA" = "1.0", ColumnValues "colA" = "2022-06-30"`
!=`x`	x Resolves to true if the rule type response is not equal to `x`.	`ColumnValues "colA" != "a", ColumnValues "colA" != "2022-06-30"`
> `x`	Resolves to `true` if the rule type response is greater than `x`.	`ColumnValues "colA" > 10`
< `x`	Resolves to `true` if the rule type response is less than `x`.	`ColumnValues "colA" < 1000, ColumnValues "colA" < "2022-06-30"`
>= `x`	Resolves to `true` if the rule type response is greater than or equal to `x`.	`ColumnValues "colA" >= 10`
<= `x`	Resolves to `true` if the rule type response is less than or equal to `x`.	`ColumnValues "colA" <= 1000`
between `x` and `y`	Resolves to `true` if the rule type response falls in a specified range (exclusive). Only use this expression type for numeric and date types.	`Mean "colA" between 8 and 100, ColumnValues "colA" between "2022-05-31" and "2022-06-30"`
not between `x` and `y`	Resolves to true if the rule type response does not fall in a specified range (inclusive). You should only use this expression type for numeric and date types.	`ColumnValues "colA" not between "2022-05-31" and "2022-06-30"`
in [`a, b, c, ...`]	Resolves to `true` if the rule type response is in the specified set.	`ColumnValues "colA" in [ 1, 2, 3 ], ColumnValues "colA" in [ "a", "b", "c" ]`
not in [`a, b, c, ...`]	Resolves to `true` if the rule type response is not in the specified set.	`ColumnValues "colA" not in [ 1, 2, 3 ], ColumnValues "colA" not in [ "a", "b", "c" ]`
matches `/ab+c/i`	Resolves to `true` if the rule type response matches a regular expression.	`ColumnValues "colA" matches "[a-zA-Z]*"`
not matches `/ab+c/i`	Resolves to `true` if the rule type response does not match a regular expression.	`ColumnValues "colA" not matches "[a-zA-Z]*"`
`now()`	Works only with the `ColumnValues` rule type to create a date expression.	`ColumnValues "load_date" > (now() - 3 days)`
matches/in […]/not matches/not in [...] `with threshold`	Specifies the percentage of values that match the rule conditions. Works only with the `ColumnValues`, `ColumnDataType`, and `CustomSQL` rule types.	`ColumnValues "colA" in ["A", "B"] with threshold > 0.8, ColumnValues "colA" matches "[a-zA-Z]*" with threshold between 0.2 and 0.9 ColumnDataType "colA" = "Timestamp" with threshold > 0.9`

Keywords for NULL, EMPTY and WHITESPACES_ONLY

If you want to validate if a string column has a null, empty or a string with only whitespaces you can use the following keywords:

NULL / null – This keyword resolves to true for a null value in a string column.

ColumnValues "colA" != NULL with threshold > 0.5 would return true if more than 50% of your data does not have null values.

(ColumnValues "colA" = NULL) or (ColumnLength "colA" > 5) would return true for all rows which either have a null value or have length >5. Note that this will require the use of the “compositeRuleEvaluation.method” = “ROW” option.
EMPTY / empty – This keyword resolves to true for an empty string (“”) value in a string column. Some data formats transform nulls in a string column to empty strings. This keyword helps filter out empty strings in your data.

(ColumnValues "colA" = EMPTY) or (ColumnValues "colA" in ["a", "b"]) would return true if a row is either empty, “a” or “b”. Note that this requires the use of the “compositeRuleEvaluation.method” = “ROW” option.
WHITESPACES_ONLY / whitespaces_only – This keyword resolves to true for a string with only whitespaces (“ ”) value in a string column.

ColumnValues "colA" not in ["a", "b", WHITESPACES_ONLY] would return true if a row is neither “a” or “b” nor just whitespaces.

Supported rules:
- ColumnValues

For a numeric or date based expression, if you want to validate if a column has a null you can use the following keywords.

NULL / null – This keyword resolves to true for a null value in a string column.

ColumnValues "colA" in [NULL, "2023-01-01"] would return true if a dates in your column are either 2023-01-01 or null.

(ColumnValues "colA" = NULL) or (ColumnValues "colA" between 1 and 9) would return true for all rows which either have a null value or have values between 1 and 9. Note that this will require the use of the “compositeRuleEvaluation.method” = “ROW” option.

Supported rules:
- ColumnValues

Filtering with Where Clause

You can filter your data when authoring rules. This is helpful when you want to apply conditional rules.


<DQDL Rule> where "<valid SparkSQL where clause> "

The filter must be specified with the where keyword, followed by a valid SparkSQL statement that is enclosed in quotes ("").

If the rule you wish to add the where clause to a rule with a threshold, the where clause should be specified before the threshold condition.


<DQDL Rule> where "valid SparkSQL statement>" with threshold <threshold condition>

With this syntax you can write rules like the following.


Completeness "colA" > 0.5 where "colB = 10"
ColumnValues "colB" in ["A", "B"] where "colC is not null" with threshold > 0.9
ColumnLength "colC" > 10 where "colD != Concat(colE, colF)"

We will validate that the SparkSQL statement provided is valid. If invalid, the rule evaluation will fail and we will throw the an IllegalArgumentException with the following format:


Rule <DQDL Rule> where "<invalid SparkSQL>" has provided an invalid where clause :
<SparkSQL Error>

Where clause behaviour when Row level error record identification is turned on

With AWS Glue Data Quality, you can identify specific records that failed. When applying a where clause to rules that support row level results, we will label the rows that are filtered out by the where clause as Passed.

If you prefer to separately label the filtered out rows as SKIPPED, you can set the following additionalOptions for the ETL job.


object GlueApp {
  val datasource = glueContext.getCatalogSource(
    database="<db>",
    tableName="<table>",
    transformationContext="datasource"
  ).getDynamicFrame()

  val ruleset = """
    Rules = [
       IsComplete "att2" where "att1 = 'a'"        
    ]
  """

  val dq_results = EvaluateDataQuality.processRows(
    frame=datasource,
    ruleset=ruleset,
    additionalOptions=JsonOptions("""
        {
          "rowLevelConfiguration.filteredRowLabel":"SKIPPED"
        }
      """
    )
  )
}

As an example, refer to the following rule and dataframe:


IsComplete att2 where "att1 = 'a'"

id	att1	att2	Row-level Results (Default)	Row Level Results (Skipped Option)	Comments
1	a	f	PASSED	PASSED
2	b	d	PASSED	SKIPPED	Row is filtered out, since `att1` is not `"a"`
3	a	null	FAILED	FAILED
4	a	f	PASSED	PASSED
5	b	null	PASSED	SKIPPED	Row is filtered out, since `att1` is not `"a"`
6	a	f	PASSED	PASSED

Dynamic rules

You can now author dynamic rules to compare current metrics produced by your rules with their historical values. These historical comparisons are enabled by using the last() operator in expressions. For example, the rule RowCount > last() will succeed when the number of rows in the current run is greater than the most recent prior row count for the same dataset. last() takes an optional natural number argument describing how many prior metrics to consider; last(k) where k >= 1 will reference the last k metrics.

If no data points are available, last(k) will return the default value 0.0.
If fewer than k metrics are available, last(k) will return all prior metrics.

To form valid expressions use last(k), where k > 1 requires an aggregation function to reduce multiple historical results to a single number. For example, RowCount > avg(last(5)) will check whether the current dataset’s row count is strictly greater than the average of the last five row counts for the same dataset. RowCount > last(5) will produce an error because the current dataset row count can't be meaningfully compared to a list.

Supported aggregation functions:

avg
median
max
min
sum
std (standard deviation)
abs (absolute value)
index(last(k), i) will allow for selecting the ith most recent value out of the last k. i is zero-indexed, so index(last(3), 0) will return the most recent datapoint and index(last(3), 3) will result in an error as there are only three datapoints and we attempt to index the 4th most recent one.

Sample expressions

ColumnCorrelation

ColumnCorrelation "colA" "colB" < avg(last(10))

DistinctValuesCount

DistinctValuesCount "colA" between min(last(10))-1 and max(last(10))+1

Most rule types with numeric conditions or thresholds support dynamic rules; see the provided table, Analyzers and Rules, to determine whether dynamic rules are supported for your rule type.

Analyzers

Note

Analyzers are not supported in AWS Glue Data Catalog.

DQDL rules use functions called analyzers to gather information about your data. This information is employed by a rule’s Boolean expression to determine whether the rule should succeed or fail. For example, the RowCount rule RowCount > 5 will use a row count analyzer to discover the number of rows in your dataset, and compare that count with the expression > 5 to check whether more than five rows exist in the current dataset.

Sometimes, instead of authoring rules, we recommend creating analyzers and then have them generate statistics that can be used to detect anomalies. For such instances, you can create analyzers. Analyzers differ from rules in the following ways.

Characteristic	Analyzers	Rules
Part of ruleset	Yes	Yes
Generates statistics	Yes	Yes
Generates observations	Yes	Yes
Can evaluate and assert a condition	No	Yes
You can configure actions such as stop the jobs on failure, continue processing job	No	Yes

Analyzers can independently exist without rules, so you can quickly configure them and progressively build data quality rules.

Some rule types can be input in the Analyzers block of your ruleset to run the rules required for analyzers and gather information without applying checks for any condition. Some analyzers aren't associated with rules and can only be input in the Analyzers block. The following table indicates whether each item is supported as a rule or a standalone analyzer, along with additional details for each rule type.

Example Ruleset with Analyzer

The following ruleset uses:

a dynamic rule to check whether a dataset is growing over its trailing average for the last three job runs
a DistinctValuesCount analyzer to record the number of distinct values in the dataset's Name column
a ColumnLength analyzer to track minimum and maximum Name size over time

Analyzer metric results can be viewed in the Data Quality tab for your job run.


Rules = [
   RowCount > avg(last(3))
]
Analyzers = [
   DistinctValuesCount "Name",
   ColumnLength "Name"
]

Comments

You can use the '#' character to add a comment to your DQDL document. Anything after the '#' character and until the end of the line is ignored by DQDL.


Rules = [
    # More items should generally mean a higher price, so correlation should be positive
    ColumnCorrelation "price" "num_items" > 0
]

DQDL rule type reference

This section provides a reference for each rule type that AWS Glue Data Quality supports.

Note

DQDL doesn't currently support nested or list-type column data.
Bracketed values in the below table will be replaced with the information provided in rule arguments.
Rules typically require an additional argument for expression.

Ruletype	Description	Arguments	Reported Metrics	Supported as Rule?	Supported as Analyzer?	Returns row-level Results?	Dynamic rule support?	Generates Observations	Supports Where Clause Syntax?
AggregateMatch	Checks if two datasets match by comparing summary metrics like total sales amount. Useful for financial institutions to compare if all data is ingested from source systems.	One or more aggregations	When first and second aggregation column names match: `Column.[Column].AggregateMatch` When first and second aggregation column names different: `Column.[Column1,Column2].AggregateMatch`	Yes	No	No	No	No	No
AllStatistics	Standalone analyzer to gather multiple metrics for the provided column, or all columns in a dataset.	A single column name, OR "AllColumns"	For columns of all types: `Dataset.*.RowCount` `Column.[Column].Completeness` `Column.[Column].Uniqueness` Additional metrics for string-valued columns: `ColumnLength metrics` Additional metrics for numeric-valued columns: `ColumnValues metrics`	No	Yes	No	No	No	No
ColumnCorrelation	Checks how well two columns are corelated.	Exactly two column names	`Multicolumn.[Column1],[Column2].ColumnCorrelation`	Yes	Yes	No	Yes	No	Yes
ColumnCount	Checks if any columns are dropped.	None	`Dataset.*.ColumnCount`	Yes	No	No	Yes	Yes	No
ColumnDataType	Checks if a column is compliant with a datatype.	Exactly one column name	`Column.[Column].ColumnDataType.Compliance`	Yes	No	No	Yes, in row-level threshold expression	No	Yes
ColumnExists	Checks if columns exist in a dataset. This allows customers building self service data platforms to ensure certain columns are made available.	Exactly one column name	N/A	Yes	No	No	No	No	No
ColumnLength	Checks if length of data is consistent.	Exactly one column name	`Column.[Column].MaximumLength` `Column.[Column].MinimumLength` Additional metric when row-level threshold provided: `Column.[Column].ColumnValues.Compliance`	Yes	Yes	Yes, when row-level threshold provided	No	Yes. Only generates observations by analyzing Minimum and Maximum length	Yes
ColumnNamesMatchPattern	Checks if column names match defined patterns. Useful for governance teams to enforce column name consistency.	A regex for column names	`Dataset.*.ColumnNamesPatternMatchRatio`	Yes	No	No	No	No	No
ColumnValues	Checks if data is consistent per defined values. This rule supports regular expressions.	Exactly one column name	`Column.[Column].Maximum` `Column.[Column].Minimum` Additional metric when row-level threshold provided: `Column.[Column].ColumnValues.Compliance`	Yes	Yes	Yes, when row-level threshold provided	No	Yes. Only generates observations by analyzing Minimum and Maximum values	Yes
Completeness	Checks for any blank or NULLs in data.	Exactly one column name	`Column.[Column].Completeness`	Yes	Yes	Yes	Yes	Yes	Yes
CustomSql	Customers can implement almost any type of data quality checks in SQL.	A SQL statement (Optional) A row-level threshold	`Dataset..CustomSQL` Additional metric when row-level threshold provided: `Dataset..CustomSQL.Compliance`	Yes	No	Yes, when row-level threshold provided	Yes	No	No
DataFreshness	Checks if data is fresh.	Exactly one column name	`Column.[Column].DataFreshness.Compliance`	Yes	No	Yes	No	No	Yes
DatasetMatch	Compares two datasets and identifies if they are in synch.	Name of a reference dataset A column mapping (Optional) Columns to check for matches	`Dataset.[ReferenceDatasetAlias].DatasetMatch`	Yes	No	Yes	Yes	No	No
DistinctValuesCount	Checks for duplicate values.	Exactly one column name	`Column.[Column].DistinctValuesCount`	Yes	Yes	Yes	Yes	Yes	Yes
DetectAnomalies	Checks for anomalies in another rule type's reported metrics.	A rule type	Metric(s) reported by the rule type argument	Yes	No	No	No	No	No
Entropy	Checks for entropy of the data.	Exactly one column name	`Column.[Column].Entropy`	Yes	Yes	No	Yes	No	Yes
IsComplete	Checks if 100% of the data is complete.	Exactly one column name	`Column.[Column].Completeness`	Yes	No	Yes	No	No	Yes
IsPrimaryKey	Checks if a column is a primary key (not NULL and unique).	Exactly one column name	For single column: `Column.[Column].Uniqueness` For multiple columns: `Multicolumn[CommaDelimitedColumns].Uniqueness`	Yes	No	Yes	No	No	Yes
IsUnique	Checks if 100% of the data is unique.	Exactly one column name	`Column.[Column].Uniqueness`	Yes	No	Yes	No	No	Yes
Mean	Checks if the mean matches the set threshold.	Exactly one column name	`Column.[Column].Mean`	Yes	Yes	Yes	Yes	No	Yes
ReferentialIntegrity	Checks if two datasets have referential integrity.	One or more column names from dataset One or more columb names from reference dataset	`Column.[ReferenceDatasetAlias].ReferentialIntegrity`	Yes	No	Yes	Yes	No	No
RowCount	Checks if record counts match a threshold.	None	`Dataset.*.RowCount`	Yes	Yes	No	Yes	Yes	Yes
RowCountMatch	Checks if record counts between two datasets match.	Reference dataset alias	`Dataset.[ReferenceDatasetAlias].RowCountMatch`	Yes	No	No	Yes	No	No
StandardDeviation	Checks if standard deviation matches the threshold.	Exactly one column name	`Column.[Column].StandardDeviation`	Yes	Yes	Yes	Yes	No	Yes
SchemaMatch	Checks if schema between two datasets match.	Reference dataset alias	`Dataset.[ReferenceDatasetAlias].SchemaMatch`	Yes	No	No	Yes	No	No
Sum	Checks if sum matches a set threshold.	Exactly one column name	`Column.[Column].Sum`	Yes	Yes	No	Yes	No	Yes
Uniqueness	Checks if uniqueness of dataset matches threshold.	Exactly one column name	`Column.[Column].Uniqueness`	Yes	Yes	Yes	Yes	No	Yes
UniqueValueRatio	Checks if the unique value ration matches threshold.	Exactly one column name	`Column.[Column].UniqueValueRatio`	Yes	Yes	Yes	Yes	No	Yes

AggregateMatch

Checks the ratio of two column aggregations against a given expression. This ruletype works on multiple datasets. The two column aggregations are evaluated and a ratio is produced by dividing the result of the first column aggregation with the result of the second column aggregation. The ratio is checked against the provided expression to produce a boolean response.

Syntax

Column aggregation


ColumnExists <AGG_OPERATION> (<OPTIONAL_REFERENCE_ALIAS>.<COL_NAME>)

AGG_OPERATION – The operation to use for the aggregation. Currently, sum and avg are supported.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
OPTIONAL_REFERENCE_ALIAS – This parameter needs to be provided if the column is from a reference dataset and not the primary dataset. If you are using this rule in the AWS Glue Data Catalog, your reference alias must follow the format "<database_name>.<table_name>.<column_name>
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
COL_NAME – The name of the column to aggregate.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

Example: Average


"avg(rating)"

Example: Sum


"sum(amount)"

Example: Average of column in reference dataset


"avg(reference.rating)"

Rule


AggregateMatch <AGG_EXP_1> <AGG_EXP_2> <EXPRESSION>

AGG_EXP_1 – The first column aggregation.

Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
AGG_EXP_2 – The second column aggregation.

Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Aggregate Match using sum

The following example rule checks whether the sum of the values in the amount column is exactly equal to the sum of the values in the total_amount column.


AggregateMatch "sum(amount)" "sum(total_amount)" = 1.0

Example: Aggregate Match using average

The following example rule checks whether the average of the values in the ratings column is equal to at least 90% of the average of the values in the ratings column in the reference dataset. The reference dataset is provided as an additional data source in the ETL or Data Catalog experience.

In AWS Glue ETL, you can use:


AggregateMatch "avg(ratings)" "avg(reference.ratings)" >= 0.9

In the AWS Glue Data Catalog, you can use:


AggregateMatch "avg(ratings)" "avg(database_name.tablename.ratings)" >= 0.9

Null behavior

The AggregateMatch rule will ignore rows with NULL values in the calculation of the aggregation methods (sum/mean). For example:


+---+-----------+
|id |units      |
+---+-----------+
|100|0          | 
|101|null       |
|102|20         |
|103|null       |
|104|40         |
+---+-----------+

The mean of column units will be (0 + 20 + 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation.

ColumnCorrelation

Checks the correlation between two columns against a given expression. AWS Glue Data Quality uses the Pearson correlation coefficient to measure the linear correlation between two columns. The result is a number between -1 and 1 that measures the strength and direction of the relationship.

Syntax


ColumnCorrelation <COL_1_NAME> <COL_2_NAME> <EXPRESSION>

COL_1_NAME – The name of the first column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
COL_2_NAME – The name of the second column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Column correlation

The following example rule checks whether the correlation coefficient between the columns height and weight has a strong positive correlation (a coefficient value greater than 0.8).


ColumnCorrelation "height" "weight" > 0.8


ColumnCorrelation "weightinkgs" "Salary" > 0.8 where "weightinkgs > 40"

Sample dynamic rules

ColumnCorrelation "colA" "colB" between min(last(10)) and max(last(10))
ColumnCorrelation "colA" "colB" < avg(last(5)) + std(last(5))

Null behavior

The ColumnCorrelation rule will ignore rows with NULL values in the calculation of the correlation. For example:


+---+-----------+
|id |units      |
+---+-----------+
|100|0          | 
|101|null       |
|102|20         |
|103|null       |
|104|40         |
+---+-----------+

Rows 101 and 103 will be ignored, and the ColumnCorrelation will be 1.0.

ColumnCount

Checks the column count of the primary dataset against a given expression. In the expression, you can specify the number of columns or a range of columns using operators like > and <.

Syntax


ColumnCount <EXPRESSION>

EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Column count numeric check

The following example rule checks whether the column count is within a given range.


ColumnCount between 10 and 20

Sample dynamic rules

ColumnCount >= avg(last(10))
ColumnCount between min(last(10))-1 and max(last(10))+1

ColumnDataType

Checks the inherent data type of the values in a given column against the provided expected type. Accepts a with threshold expression to check for a subset of the values in the column.

Syntax


ColumnDataType <COL_NAME> = <EXPECTED_TYPE>
	  ColumnDataType <COL_NAME> = <EXPECTED_TYPE> with threshold <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.

Supported column types: String type
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPECTED_TYPE – The expected type of the values in the column.

Supported values: Boolean, Date, Timestamp, Integer, Double, Float, Long
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An optional expression to specify the percentage of values that should be of the expected type.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

Example: Column data type integers as strings

The following example rule checks whether the values in the given column, which is of type string, are actually integers.


ColumnDataType "colA" = "INTEGER"

Example: Column data type integers as strings check for a subset of the values

The following example rule checks whether more than 90% of the values in the given column, which is of type string, are actually integers.


ColumnDataType "colA" = "INTEGER" with threshold > 0.9

ColumnExists

Checks whether a column exists.

Syntax


ColumnExists <COL_NAME>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type

Example: Column exists

The following example rule checks whether the column named Middle_Name exists.


ColumnExists "Middle_Name"

ColumnLength

Checks whether the length of each row in a column conforms to a given expression.

Syntax


ColumnLength <COL_NAME><EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: String
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Column row length

The following example rule checks whether the value in each row in the column named Postal_Code is 5 characters long.


ColumnLength "Postal_Code" = 5
ColumnLength "weightinkgs" = 2 where "weightinkgs > 10"

Null behavior

The ColumnLength rule treats NULLs as 0 length strings. For a NULL row:


ColumnLength "Postal_Code" > 4 # this will fail


ColumnLength "Postal_Code" < 6 # this will succeed

The following example compound rule provides a way to explicitly fail NULL values:


(ColumnLength "Postal_Code" > 4) AND (ColumnValues != NULL)

ColumnNamesMatchPattern

Checks whether the names of all columns in the primary dataset match the given regular expression.

Syntax


ColumnNamesMatchPattern <PATTERN>

PATTERN – The pattern you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

Example: Column names match pattern

The following example rule checks whether all columns start with the prefix "aws_"


ColumnNamesMatchPattern "aws_.*"
ColumnNamesMatchPattern "aws_.*" where "weightinkgs > 10"

ColumnValues

Runs an expression against the values in a column.

Syntax


ColumnValues <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Allowed values

The following example rule checks whether each value in the specified column is in a set of allowed values (including null, empty, and strings with only whitespaces).


ColumnValues "Country" in [ "US", "CA", "UK", NULL, EMPTY, WHITESPACES_ONLY ]
ColumnValues "gender" in ["F", "M"] where "weightinkgs < 10"

Example: Regular expression

The following example rule checks the values in a column against a regular expression.


ColumnValues "First_Name" matches "[a-zA-Z]*"

Example: Date values

The following example rule checks the values in a date column against a date expression.


ColumnValues "Load_Date" > (now() - 3 days)

Example: Numeric values

The following example rule checks whether the column values match a certain numeric constraint.


ColumnValues "Customer_ID" between 1 and 2000

Null behavior

For all ColumnValues rules (other than != and NOT IN), NULL rows will fail the rule. If the rule fails due to a null value, the failure reason will display the following:


Value: NULL does not meet the constraint requirement!

The following example compound rule provides a way to explicitly allow for NULL values:


(ColumnValues "Age" > 21) OR (ColumnValues "Age" = NULL)

Negated ColumnValues rules using the != and not in syntax will pass for NULL rows. For example:


ColumnValues "Age" != 21


ColumnValues "Age" not in [21, 22, 23]

The following examples provide a way to explicitly fail NULL values


(ColumnValues "Age" != 21) AND (ColumnValues "Age" != NULL)


ColumnValues "Age" not in [21, 22, 23, NULL]

Completeness

Checks the percentage of complete (non-null) values in a column against a given expression.

Syntax


Completeness <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Null value percentage

The following example rules check if more than 95 percent of the values in a column are complete.


Completeness "First_Name" > 0.95
Completeness "First_Name" > 0.95 where "weightinkgs > 10"

Sample dynamic rules

Completeness "colA" between min(last(5)) - 1 and max(last(5)) + 1
Completeness "colA" <= avg(last(10))

Null behavior

Note on CSV Data Formats: Blank rows on CSV columns can display multiple behaviors.

If a column is of String type, the blank row will be recognized as an empty string and will not fail the Completeness rule.
If a column is of another data type like Int, the blank row will be recognized as NULL and will fail the Completeness rule.

CustomSQL

This rule type has been extended to support two use cases:

Run a custom SQL statement against a dataset and checks the return value against a given expression.
Run a custom SQL statement where you specify a column name in your SELECT statement against which you compare with some condition to get row-level results.

Syntax


CustomSql <SQL_STATEMENT> <EXPRESSION>

SQL_STATEMENT – A SQL statement that returns a single numeric value, surrounded by double quotes.
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Custom SQL to retrieve an overall rule outcome

This example rule uses a SQL statement to retrieve the record count for a data set. The rule then checks that the record count is between 10 and 20.


CustomSql "select count(*) from primary" between 10 and 20

Example: Custom SQL to retrieve row-level results

This example rule uses a SQL statement wherein you specify a column name in your SELECT statement against which you compare with some condition to get row level results. A threshold condition expression defines a threshold of how many records should fail for the entire rule to fail. Note that a rule may not contain both a condition and keyword together.


CustomSql "select Name from primary where Age  > 18"


CustomSql "select Name from primary where Age > 18" with threshold  > 3

Important

The primary alias stands in for the name of the data set that you want to evaluate. When you work with visual ETL jobs on the console, primary always represents the DynamicFrame being passed to the EvaluateDataQuality.apply() transform. When you use the AWS Glue Data Catalog to run data quality tasks against a table, primary represents the table.

If you are in AWS Glue Data Catalog, you can also use the actual table names:


CustomSql "select count(*) from database.table" between 10 and 20

You can also join multiple tables to compare different data elements:


CustomSql "select count(*) from database.table inner join database.table2 on id1 = id2" between 10 and 20

In AWS Glue ETL, CustomSQL can identify records that failed the data quality checks. For this to work, you will need to return records that are part of the primary table that you are evaluating data quality. Records that are returned as part of the query are considered successful and records that are not returned are considered failed.

The following rule will ensure that records with age < 100 are identified as successful and records that are above are marked as failed.


CustomSql "select id from primary where age < 100"

This CustomSQL rule will pass when 50% of the records have age > 10 and will also identify records that failed. The records returned by this CustomSQL will be considered passed while the ones not returned will be considered failed.


CustomSQL "select ID, CustomerID from primary where age > 10" with threshold > 0.5

Note: CustomSQL rule will fail if you return records that are not available in the dataset.

DataFreshness

Checks the freshness of data in a column by evaluating the difference between the current time and the values of a date column. You can specify a time-based expression for this rule type to make sure that column values are up to date.

Syntax


DataFreshness <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Date
EXPRESSION – A numeric expression in hours or days. You must specify the time unit in your expression.

Example: Data freshness

The following example rules check for data freshness.


DataFreshness "Order_Date" <= 24 hours
DataFreshness "Order_Date" between 2 days and 5 days

Null behavior

The DataFreshness rules will fail for rows with NULL values. If the rule fails due to a null value, the failure reason will display the following:


80.00 % of rows passed the threshold

where 20% of the rows that failed include the rows with NULL.

The following example compound rule provides a way to explicitly allow for NULL values:


(DataFreshness "Order_Date" <= 24 hours) OR (ColumnValues "Order_Date" = NULL)

Data Freshness for Amazon S3 objects

Sometimes you will need to validate the freshness of data based on the Amazon S3 file creating time. To do this, you can use the following code to get the timestamp and add it to your dataframe, and then apply Data Freshness checks.


df = glueContext.create_data_frame.from_catalog(database = "default", table_name = "mytable")
df = df.withColumn("file_ts", df["_metadata.file_modification_time"])

Rules = [
 DataFreshness "file_ts" < 24 hours 
]

DatasetMatch

Checks if the data in the primary dataset matches the data in a reference dataset. The two datasets are joined using the provided key column mappings. Additional column mappings can be provided should you wish to check for the equality of the data in only those columns. Note that for DataSetMatch to work, your join keys should be unique and should not be NULL (must be a primary key). If you don’t satisfy these conditions, you will get the error message, “Provided key map not suitable for given data frames”. In cases where you can’t have joined keys that are unique, consider using other ruletypes such as AggregateMatch to match on summary data.

Syntax


DatasetMatch <REFERENCE_DATASET_ALIAS> <JOIN CONDITION WITH MAPPING> <OPTIONAL_MATCH_COLUMN_MAPPINGS> <EXPRESSION>

REFERENCE_DATASET_ALIAS – The alias of the reference dataset with which you compare data from the primary dataset.
KEY_COLUMN_MAPPINGS – A comma-separated list of column names that form a key in the datasets. If the column names are not the same in both datasets, you must separated them with a ->
OPTIONAL_MATCH_COLUMN_MAPPINGS – You can supply this parameter if you want to check for matching data only in certain columns. It uses the same syntax as the key column mappings. If this parameter is not provided, we will match the data in all remaining columns. The remaining, non-key columns must have the same names in both datasets.
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Match set datasets using ID column

The following example rule checks that more than 90% of the primary dataset matches the reference dataset, using the "ID" column to join the two datasets. It compares all columns in this case.


DatasetMatch "reference" "ID" >= 0.9

Example: Match set datasets using multiple key columns

In the following example, the primary dataset and the reference dataset have different names for the key columns. ID_1 and ID_2 together form a composite key in the primary dataset. ID_ref1 and ID_ref2 together forms a composite key in the reference dataset. In this scenario, you can use the special syntax to supply the column names.


DatasetMatch "reference" "ID_1->ID_ref1,ID_ref2->ID_ref2" >= 0.9

Example: Match set datasets using multiple key columns and check that specific column matches

This example builds on the previous example. We want to check that only the column containing the amounts match. This column is named Amount1 in the primary dataset and Amount2 in the reference dataset. You want an exact match.


DatasetMatch "reference" "ID_1->ID_ref1,ID_ref2->ID_ref2" "Amount1->Amount2" >= 0.9

DistinctValuesCount

Checks the number of distinct values in a column against a given expression.

Syntax


DistinctValuesCount <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Distinct column value count

The following example rule checks that the column named State contains more than 3 distinct values.


DistinctValuesCount "State" > 3
DistinctValuesCount "Customer_ID" < 6  where "Customer_ID < 10"

Sample dynamic rules

DistinctValuesCount "colA" between avg(last(10))-1 and avg(last(10))+1
DistinctValuesCount "colA" <= index(last(10),2) + std(last(5))

Entropy

Checks whether the entropy value of a column matches a given expression. Entropy measures the level of information that's contained in a message. Given the probability distribution over values in a column, entropy describes how many bits are required to identify a value.

Syntax


Entropy <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Column entropy

The following example rule checks that the column named Feedback has an entropy value greater than one.


Entropy "Star_Rating" > 1
Entropy "First_Name" > 1  where "Customer_ID < 10"

Sample dynamic rules

Entropy "colA" < max(last(10))
Entropy "colA" between min(last(10)) and max(last(10))

IsComplete

Checks whether all of the values in a column are complete (non-null).

Syntax


IsComplete <COL_NAME>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type

Example: Null values

The following example checks whether all of the values in a column named email are non-null.


IsComplete "email"
IsComplete "Email" where "Customer_ID between 1 and 50"
IsComplete "Customer_ID"  where "Customer_ID < 16 and Customer_ID != 12"
IsComplete "passenger_count" where "payment_type<>0"

Null behavior

Note on CSV Data Formats: Blank rows on CSV columns can display multiple behaviors.

If a column is of String type, the blank row will be recognized as an empty string and will not fail the Completeness rule.
If a column is of another data type like Int, the blank row will be recognized as NULL and will fail the Completeness rule.

IsPrimaryKey

Checks whether a column contains a primary key. A column contains a primary key if all of the values in the column are unique and complete (non-null).

Syntax


IsPrimaryKey <COL_NAME>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type

Example: Primary key

The following example rule checks whether the column named Customer_ID contains a primary key.


IsPrimaryKey "Customer_ID"
IsPrimaryKey "Customer_ID" where "Customer_ID < 10"

Example: Primary key with multiple columns. Any of the following examples are valid.


IsPrimaryKey "colA" "colB"
IsPrimaryKey "colA" "colB" "colC"
IsPrimaryKey colA "colB" "colC"

IsUnique

Checks whether all of the values in a column are unique, and returns a Boolean value.

Syntax


IsUnique <COL_NAME>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type

Example: Unique column values

The following example rule checks whether all of the values in a column named email are unique.


IsUnique "email"
IsUnique "Customer_ID" where "Customer_ID < 10"]

Mean

Checks whether the mean (average) of all the values in a column matches a given expression.

Syntax


Mean <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Average value

The following example rule checks whether the average of all of the values in a column exceeds a threshold.


Mean "Star_Rating" > 3
Mean "Salary" < 6200 where "Customer_ID < 10"

Sample dynamic rules

Mean "colA" > avg(last(10)) + std(last(2))
Mean "colA" between min(last(5)) - 1 and max(last(5)) + 1

Null behavior

The Mean rule will ignore rows with NULL values in the calculation of the mean. For example:


+---+-----------+
|id |units      |
+---+-----------+
|100|0          |
|101|null       |
|102|20         |
|103|null       |
|104|40         |
+---+-----------+

The mean of column units will be (0 + 20 + 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation.

ReferentialIntegrity

Checks to what extent the values of a set of columns in the primary dataset are a subset of the values of a set of columns in a reference dataset.

Syntax


ReferentialIntegrity <PRIMARY_COLS> <REFERENCE_DATASET_COLS> <EXPRESSION>

PRIMARY_COLS – A comma-separated list of column names in the primary dataset.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
REFERENCE_DATASET_COLS – This parameter contains two parts separated by a period. The first part is the alias of the reference dataset. The second part is the comma-separated list of column names in the reference dataset enclosed in braces.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Check the referential integrity of a zip code column

The following example rule checks that more than 90% of the values in the zipcode column in the primary dataset, are present in the zipcode column in the reference dataset.


ReferentialIntegrity "zipcode" "reference.zipcode" >= 0.9

Example: Check the referential integrity of the city and state columns

In the following example, columns containing city and state information exist in the primary dataset and the reference dataset. The names of the columns are different in both datasets. The rule checks if the set of values of the columns in the primary dataset is exactly equal to the set of values of the columns in the reference dataset.


ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" = 1.0

Sample dynamic rules

ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" > avg(last(10))
ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" between min(last(10)) - 1 and max(last(10)) + 1

RowCount

Checks the row count of a dataset against a given expression. In the expression, you can specify the number of rows or a range of rows using operators like > and <.

Syntax


RowCount <EXPRESSION>

EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Row count numeric check

The following example rule checks whether the row count is within a given range.


RowCount between 10 and 100
RowCount between 1 and 50 where "Customer_ID < 10"

Sample dynamic rules


RowCount > avg(lats(10)) *0.8

RowCountMatch

Checks the ratio of the row count of the primary dataset and the row count of a reference dataset against the given expression.

Syntax


RowCountMatch <REFERENCE_DATASET_ALIAS> <EXPRESSION>

REFERENCE_DATASET_ALIAS – The alias of the reference dataset against which to compare row counts.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Row count check against a reference dataset

The following example rule checks whether the row count of the primary dataset is at least 90% of the row count of the reference dataset.


RowCountMatch "reference" >= 0.9

StandardDeviation

Checks the standard deviation of all of the values in a column against a given expression.

Syntax


StandardDeviation <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Standard deviation

The following example rule checks whether the standard deviation of the values in a column named colA is less than a specified value.


StandardDeviation "Star_Rating" < 1.5
StandardDeviation "Salary" < 3500 where "Customer_ID < 10"

Sample dynamic rules

StandardDeviation "colA" > avg(last(10) + 0.1
StandardDeviation "colA" between min(last(10)) - 1 and max(last(10)) + 1

Null behavior

The StandardDeviation rule will ignore rows with NULL values in the calculation of standard deviation. For example:


+---+-----------+-----------+
|id |units1     |units2     |
+---+-----------+-----------+
|100|0          |0          |
|101|null       |0          |
|102|20         |20         |
|103|null       |0          |
|104|40         |40         |
+---+-----------+-----------+

The standard deviation of column units1 will not consider rows 101 and 103 and result to 16.33. The standard deviation for column units2 will result in 16.

Sum

Checks the sum of all the values in a column against a given expression.

Syntax


Sum <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Sum

The following example rule checks whether the sum of all of the values in a column exceeds a given threshold.


Sum "transaction_total" > 500000
Sum "Salary" < 55600 where "Customer_ID < 10"

Sample dynamic rules

Sum "ColA" > avg(last(10))
Sum "colA" between min(last(10)) - 1 and max(last(10)) + 1

Null behavior

The Sum rule will ignore rows with NULL values in the calculation of sum. For example:


+---+-----------+
|id |units      |
+---+-----------+
|100|0          |
|101|null       |
|102|20         |
|103|null       |
|104|40         |
+---+-----------+

The sum of column units will not consider rows 101 and 103 and result to (0 + 20 + 40) = 60.

SchemaMatch

Checks if the schema of the primary dataset matches the schema of a reference dataset. The schema check is done column by column. The schema of two columns match if the names are identical and the types are identical. The order of the columns does not matter.

Syntax


SchemaMatch <REFERENCE_DATASET_ALIAS> <EXPRESSION>

REFERENCE_DATASET_ALIAS – The alias of the reference dataset against which to compare schemas.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: SchemaMatch

The following example rule checks whether the schema of the primary dataset exactly matches the schema of a reference dataset.


SchemaMatch "reference" = 1.0

Uniqueness

Checks the percentage of unique values in a column against a given expression. Unique values occur exactly once.

Syntax


Uniqueness <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Uniqueness percentage

The following example rule checks whether the percentage of unique values in a column matches certain numeric criteria.


Uniqueness "email" = 1.0
Uniqueness "Customer_ID" != 1.0 where "Customer_ID < 10"

Sample dynamic rules

Uniqueness "colA" between min(last(10)) and max(last(10))
Uniqueness "colA" >= avg(last(10))

UniqueValueRatio

Checks the unique value ratio of a column against a given expression. A unique value ratio is the fraction of unique values divided by the number of all distinct values in a column. Unique values occur exactly one time, while distinct values occur at least once.

For example, the set [a, a, b] contains one unique value (b) and two distinct values (a and b). So the unique value ratio of the set is ½ = 0.5.

Syntax


UniqueValueRatio <COL_NAME> <EXPRESSION>

COL_NAME – The name of the column that you want to evaluate the data quality rule against.
Supported column types: Any column type
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.

Example: Unique value ratio

This example checks the unique value ratio of a column against a range of values.


UniqueValueRatio "test_score" between 0 and 0.5
UniqueValueRatio "Customer_ID" between 0 and 0.9 where "Customer_ID < 10"

Sample dynamic rules

UniqueValueRatio "colA" > avg(last(10))
UniqueValueRatio "colA" <= index(last(10),2) + std(last(5))

DetectAnomalies

Detects anomalies for a given data quality rule. Every execution of DetectAnomalies rule result in saving evaluated value for the given rule. When there is enough data gathered, anomaly detection algorithm takes all historical data for that given rule and runs anomaly detection. DetectAnomalies rule fails when anomaly is detected. More info about what anomaly was detected can be obtained from Observations.

Syntax



       DetectAnomalies <RULE_NAME> <RULE_PARAMETERS>

RULE_NAME – The name of the rule that you want to evaluate and detect anomalies for. Supported rules:

"RowCount"
"Completeness"
"Uniqueness"
"Mean"
"Sum"
"StandardDeviation"
"Entropy"
"DistinctValuesCount"
"UniqueValueRatio"
"ColumnLength"
"ColumnValues"
"ColumnCorrelation"

RULE_PARAMETERS – some rules require additional parameters to run. Refer to the given rule documentation to see required parameters.

Example: Anomalies for RowCount

For example, if we want to detect RowCount anomalies, we provide RowCount as a rule name.


DetectAnomalies "RowCount"

Example: Anomalies for ColumnLength

For example, if we want to detect ColumnLength anomalies, we provide ColumnLength as a rule name and the column name.


DetectAnomalies "ColumnLength" "id"

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Data Quality for ETL jobs in AWS Glue Studio notebooks

Using APIs to measure and manage data quality

Data Quality Definition Language (DQDL) reference

Note

Contents

DQDL syntax

Rule structure

Composite rules

How Composite rules work

Expressions

Keywords for NULL, EMPTY and WHITESPACES_ONLY

Filtering with Where Clause

Dynamic rules

Analyzers

Note

Comments

DQDL rule type reference

Note

Topics

AggregateMatch

ColumnCorrelation

ColumnCount

ColumnDataType

ColumnExists

ColumnLength

ColumnNamesMatchPattern

ColumnValues

Completeness

CustomSQL

Important

DataFreshness

DatasetMatch

DistinctValuesCount

Entropy

IsComplete

IsPrimaryKey

IsUnique

Mean

ReferentialIntegrity

RowCount

RowCountMatch

StandardDeviation

Sum

SchemaMatch

Uniqueness

UniqueValueRatio

DetectAnomalies