DQDL rule type reference - AWS Glue

DQDL rule type reference

This section provides a reference for each rule type that AWS Glue Data Quality supports.

Note
  • DQDL doesn't currently support nested or list-type column data.

  • Bracketed values in the below table will be replaced with the information provided in rule arguments.

  • Rules typically require an additional argument for expression.

Ruletype Description Arguments Reported Metrics Supported as Rule? Supported as Analyzer? Returns row-level Results? Dynamic rule support? Generates Observations Supports Where Clause Syntax?
AggregateMatch Checks if two datasets match by comparing summary metrics like total sales amount. Useful for financial institutions to compare if all data is ingested from source systems. One or more aggregations

When first and second aggregation column names match:

Column.[Column].AggregateMatch

When first and second aggregation column names different:

Column.[Column1,Column2].AggregateMatch

Yes No No No No No
AllStatistics Standalone analyzer to gather multiple metrics for the provided column in a dataset. A single column name

For columns of all types:

Dataset.*.RowCount

Column.[Column].Completeness

Column.[Column].Uniqueness

Additional metrics for string-valued columns:

ColumnLength metrics

Additional metrics for numeric-valued columns:

ColumnValues metrics

No Yes No No No No
ColumnCorrelation Checks how well two columns are corelated. Exactly two column names Multicolumn.[Column1,Column2].ColumnCorrelation Yes Yes No Yes No Yes
ColumnCount Checks if any columns are dropped. None Dataset.*.ColumnCount Yes Yes No Yes Yes No
ColumnDataType Checks if a column is compliant with a datatype. Exactly one column name Column.[Column].ColumnDataType.Compliance Yes No No Yes, in row-level threshold expression No Yes
ColumnExists Checks if columns exist in a dataset. This allows customers building self service data platforms to ensure certain columns are made available. Exactly one column name N/A Yes No No No No No
ColumnLength Checks if length of data is consistent. Exactly one column name

Column.[Column].MaximumLength

Column.[Column].MinimumLength

Additional metric when row-level threshold provided:

Column.[Column].ColumnValues.Compliance

Yes Yes Yes, when row-level threshold provided No Yes. Only generates observations by analyzing Minimum and Maximum length Yes
ColumnNamesMatchPattern Checks if column names match defined patterns. Useful for governance teams to enforce column name consistency. A regex for column names Dataset.*.ColumnNamesPatternMatchRatio Yes No No No No No
ColumnValues Checks if data is consistent per defined values. This rule supports regular expressions. Exactly one column name

Column.[Column].Maximum

Column.[Column].Minimum

Additional metric when row-level threshold provided:

Column.[Column].ColumnValues.Compliance

Yes Yes Yes, when row-level threshold provided No Yes. Only generates observations by analyzing Minimum and Maximum values Yes
Completeness Checks for any blank or NULLs in data. Exactly one column name

Column.[Column].Completeness

Yes Yes Yes Yes Yes Yes
CustomSql Customers can implement almost any type of data quality checks in SQL.

A SQL statement

(Optional) A row-level threshold

Dataset.*.CustomSQL

Additional metric when row-level threshold provided:

Dataset.*.CustomSQL.Compliance

Yes No Yes, when row-level threshold provided Yes No No
DataFreshness Checks if data is fresh. Exactly one column name Column.[Column].DataFreshness.Compliance Yes No Yes No No Yes
DatasetMatch Compares two datasets and identifies if they are in synch.

Name of a reference dataset

A column mapping

(Optional) Columns to check for matches

Dataset.[ReferenceDatasetAlias].DatasetMatch Yes No Yes Yes No No
DistinctValuesCount Checks for duplicate values. Exactly one column name Column.[Column].DistinctValuesCount Yes Yes Yes Yes Yes Yes
DetectAnomalies Checks for anomalies in another rule type's reported metrics. A rule type Metric(s) reported by the rule type argument Yes No No No No No
Entropy Checks for entropy of the data. Exactly one column name Column.[Column].Entropy Yes Yes No Yes No Yes
IsComplete Checks if 100% of the data is complete. Exactly one column name Column.[Column].Completeness Yes No Yes No No Yes
IsPrimaryKey Checks if a column is a primary key (not NULL and unique). Exactly one column name

For single column:

Column.[Column].Uniqueness

For multiple columns:

Multicolumn.[CommaDelimitedColumns].Uniqueness

Yes No Yes No No Yes
IsUnique Checks if 100% of the data is unique. Exactly one column name Column.[Column].Uniqueness Yes No Yes No No Yes
Mean Checks if the mean matches the set threshold. Exactly one column name Column.[Column].Mean Yes Yes Yes Yes No Yes
ReferentialIntegrity Checks if two datasets have referential integrity.

One or more column names from dataset

One or more column names from reference dataset

Column.[ReferenceDatasetAlias].ReferentialIntegrity Yes No Yes Yes No No
RowCount Checks if record counts match a threshold. None Dataset.*.RowCount Yes Yes No Yes Yes Yes
RowCountMatch Checks if record counts between two datasets match. Reference dataset alias Dataset.[ReferenceDatasetAlias].RowCountMatch Yes No No Yes No No
StandardDeviation Checks if standard deviation matches the threshold. Exactly one column name Column.[Column].StandardDeviation Yes Yes Yes Yes No Yes
SchemaMatch Checks if schema between two datasets match. Reference dataset alias Dataset.[ReferenceDatasetAlias].SchemaMatch Yes No No Yes No No
Sum Checks if sum matches a set threshold. Exactly one column name Column.[Column].Sum Yes Yes No Yes No Yes
Uniqueness Checks if uniqueness of dataset matches threshold. Exactly one column name Column.[Column].Uniqueness Yes Yes Yes Yes No Yes
UniqueValueRatio Checks if the unique value ration matches threshold. Exactly one column name Column.[Column].UniqueValueRatio Yes Yes Yes Yes No Yes