FileFreshness - AWS Glue

FileFreshness

Note

For all File-based rules, you must run the job in the same region as your bucket. If you’re attempting to parse an Amazon S3 folder, that folder must exist in Amazon S3.

FileFreshness ensures your data files are fresh based on the condition you provide. It uses your files' last modified time to ensure that data files or the entire folder is up-to-date.

This rule gathers two metrics:

  • FileFreshness compliance based on the rule you set up

  • Number of files that were modified for the day

{"Dataset.*.FileFreshness.Compliance":1,"Dataset.*.FileCount":1}

Anomaly detection does not consider these metrics.

Checking file freshness

The following rule ensures that tickets.parquet was created in the past 24 hours.

FileFreshness "s3://bucket/artifacts/file/tickets/tickets.parquet" > (now() - 24 hours)

Checking folder freshness

The following rule passes if all files in the folder were created or modified in past 24 hours.

FileFreshness "s3://bucket/" >= (now() -1 days) FileFreshness "s3://bucket/artifacts/file/tickets/" >= (now() - 24 hours)

Checking folder or file freshness with threshold

The following rule passes if 10% of the files in the folder “tickets“ were created or modified in the past 10 days.

FileFreshness "s3://bucket/artifacts/file/tickets/" < (now() - 10 days) with threshold > 0.1

Checking files or folders with specific dates

You can check for file freshness for specific days.

FileFreshness "s3://bucket/artifacts/file/tickets/" > "2020-01-01" FileFreshness "s3://bucket/artifacts/file/tickets/" between "2023-01-01" and "2024-01-01"

Inferring file names directly from data frames

You don't always have to provide a file path. For instance, when you are authoring the rule in the AWS Glue Data Catalog, it may be hard to find which folders the catalog tables are using. AWS Glue Data Quality can find the specific folders or files used to populate your dataframe and can detect if they are fresh.

FileFreshness > (now() - 24 hours)

This rule will find the folder path or files that are used to populate the dynamic frame or data frame. This works for Amazon S3 paths or Amazon S3 based AWS Glue Data Catalog tables. There are a few considerations:

  1. In AWS Glue ETL, you must have the EvaluateDataQuality Transform immediately after an Amazon S3 or AWS Glue Data Catalog transform.

    The screenshot shows an Evaluate Data Quality node connected to an Amazon S3 node.
  2. This rule will not work in AWS Glue Interactive Sessions.

If you attempt in both of the cases, or when Glue can’t find the files, It will throw the following error: “Unable to parse file path from DataFrame”