Building an operationally excellent data pipeline - AWS Glue Best Practices: Building an Operationally Efficient Data Pipeline

Building an operationally excellent data pipeline

The Operational Excellence pillar includes the ability to support development and run data pipelines effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value. Here are some considerations to review when designing data pipelines using the guidelines of the Operational Excellence pillar of the AWS Well-Architected Framework.

Using AWS Glue blueprints

Blueprints in AWS Glue provide organizations with a mechanism to develop, share and reuse complex ETL workflows. They assist ETL developers to generate and publish templates of commonly asked ETL workflows that analysts or other non-developers can consume and run jobs without having to write code.

A blueprint in AWS Glue is a zip archive that consists of the following files:

  • A blueprint configuration file — This file describes the name and data type of all the blueprint parameters. Parameter names could be IAM roles, input/output (I/O) locations, workflow names, and so on. The configuration file also contains a reference to the workflow layout generator function definition.

  • A layout script — This consists of the implementation of the layout generator. The script performs the prerequisite tasks such as creating the catalog tables, temporary paths, and so on. It also creates the AWS Glue workflow, which may be a collection of jobs, triggers, and crawlers.

  • Scripts (optional) – These are supporting scripts or AWS Glue ETL jobs that constitute the workflow.

  • Supporting documents (optional) — Readme files or any documents supporting the blueprints.

Blueprints can be parametrized, which means that a single blueprint can be used to solve multiple use cases. A few use cases are:

  • Compacting small files on S3

  • Partitioning datasets

  • Creating a database or table snapshot

  • Converting data from one file format to another

Blueprint lifecycle

The following diagram summarizes the lifecycle of an AWS Glue blueprint.


          A diagram depicting the  lifecycle of an AWS Glue blueprint.

Lifecycle of an AWS Glue blueprint

  1. An AWS Glue developer develops and test the workflows, and bundles them into a blueprint structure. They publish it to the organization’s central repository. This repository could be a file system, S3, or a code repository such as GitHub.

  2. The AWS Glue admin user exports these blueprints from the centralized store. The central store may be a private repository within the organization, or it can be a public repository that hosts blueprints authored by a community of developers.

  3. The administrator then registers the blueprint with the AWS Glue services and provides the analysts or other users the necessary access.

  4. The analyst or any users can now use the blueprint, configure it to their business needs, create and run workflow, and consume results without having to write any code.

Orchestrating AWS Glue jobs

You have several mechanisms to orchestrate and automate your AWS Glue jobs using AWS native and managed orchestration services such as:

AWS Glue workflows

AWS Glue workflows are a built-in feature of AWS Glue ETL which enables you to create workflows of your jobs and crawlers, and lets you add triggers to begin the orchestration process. AWS Glue provides a graphical user interface (GUI) for building your workflow that is simple to use, and provides core orchestration capability, which is ideal for simple workflows that do not require other AWS services such as SNS. It is simple to set up, and provides core orchestration capability, which is ideal for simple workflows that do not involve any additional services such as SNS.

AWS Step Function

AWS Step Functions is a low-code, serverless visual workflow service used to orchestrate AWS services such as AWS Glue to automate and orchestrate ETL jobs and crawlers, and integrate with additional AWS services such as SNS for notification or AWS Lambda for generation of trigger of workflow for a file upload event into S3. Using AWS Step Functions, you can manage failures, retries, parallelization, service integrations, and observability. Being complete serverless makes Step Functions an ideal choice when you don’t want to manage infrastructure for orchestration purposes. An AWS Step Function can be created using both GUI and Amazon States Language, which is a JSON-based language used to describe state machines declaratively. Following are some best practices for using AWS Step Functions:

  • Use timeouts to avoid stuck job runs: The Amazon States Language doesn't set timeouts in state machine definitions by default. If there is no explicit timeout, Step Functions often relies solely on a response from an activity worker to know that a task is complete. If something goes wrong and TimeoutSeconds isn't specified, a job run is stuck waiting for a response from the ETL job which may come much later. To avoid this, specify a reasonable timeout when you create a task in your state machine. Here is an example of timeout in a task:

    "ActivityState": { "Type": "Task", "Resource": "arn:aws:states:us-east-1:123456789012:activity:HelloWorld", "TimeoutSeconds": 300, "HeartbeatSeconds": 60, "Next": "NextState" }
  • Use Amazon S3 ARNs instead of passing large payloads: Job runs that pass large payloads of data between states can be ended. If the data you are passing between states might grow to over 262,144 bytes, use S3 to store the data, and parse the Amazon Resource Name (ARN) of the bucket in the Payload parameter to get the bucket name and key value. Alternatively, adjust your implementation so that you pass smaller payloads in your job runs.

  • Amazon CloudWatch Logs resource policy size restrictions: CloudWatch Logs resource policies are limited to 5120 characters. When CloudWatch Logs detects that a policy approaches this size limit, it automatically enables log groups that start with /aws/vendedlogs/. When you create a state machine with logging enabled, Step Functions must update your CloudWatch Logs resource policy with the log group you specify. To avoid reaching the CloudWatch Logs resource policy size limit, prefix your CloudWatch Logs log group names with /aws/vendedlogs/. When you create a log group in the Step Functions console, the log group names are prefixed with /aws/vendedlogs/states. For more information, refer to Enabling Logging from Certain AWS Services.

AWS Managed Workflow for Apache Airflow (MWAA)

AWS Managed Workflows for Apache Airflow (MWAA) is a managed orchestration service for Apache Airflow that makes it easier to set up and operate end-to-end data pipelines in the cloud at scale. Apache Airflow is an open-source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as “workflows.” MWAA helps to manage the underlying infrastructure for scalability, availability, and security, so that you can focus more time on developing the workflows rather than managing the operation of the orchestration servers. If your team is already using a workflow that uses Apache Airflow and you are looking for integration of AWS services such as EMR, Amazon Redshift, S3, and so on, then MWAA may be a good choice for your workflow orchestration.

As MWAA is a managed orchestration service of Apache Airflow, many of the best practices of Apache Airflow for developing directed acyclic graphs (DAGs) are also applicable here. Following are some of the considerations for data pipeline orchestration with MWAA:

  • Make DAGs and tasks immutable — In every workflow run, DAGs should produce similar data. It’s a best practice to make read/write to and from a partition idempotent.

  • Grant least privilege required — In accordance with standard AWS security best practices, grant permissions to only the resources or actions that users need to perform tasks.

  • Monitoring user activity — Use AWS CloudTrail to monitor user activity in your account.

  • Ensure that the S3 bucket policy and object access control lists (ACLs) grant permissions to the users from the associated MWAA environment to put objects in the bucket. This ensures that users with permissions to add workflows to the bucket also have permissions to run the workflows in Airflow.

  • Use the S3 buckets associated with Amazon MWAA environments for Amazon MWAA only. It’s a best practice not to store other objects in the bucket, or use the bucket with another service.

  • For more information on performance tuning best practices, refer to Performance tuning for Apache Airflow on Amazon MWAA.

  • For best practices on managing Python dependencies refer to Managing Python dependencies in requirements.txt.

Table 1 — When to use AWS Glue Workflow, AWS Step Functions, or Amazon MWAA

Factor AWS Glue Workflow AWS Step Function Amazon Managed Workflow for Apache Airflow (MWAA)
Use case Suitable when your pipeline consists of mostly AWS Glue jobs and crawlers. Suitable when there is a need to integrate with different services, including AWS Lambda, SSM, and so on. Compatible with open-source Airflow and suitable when you want to reuse existing Airflow assets.
Infrastructure Serverless Serverless Managed service

User interface and

supported language

Simple UI in AWS Glue Console, API/SDK, CloudFormation, Python (via custom blueprint)

Interactive / rich UI based on

Amazon State Language (ASL), JSON, and YAML

Any supported integrated development environment (IDE) for Python
Building a pipeline Build a data pipeline using an AWS Glue job written in Python or /Scala and crawlers. Possible to integrate with other services using AWS SDK for Python (boto3). Build a data pipeline using the Step Functions console. Possible to integrate with non-supported services using Lambda. Workflows are created as DAGs, which are defined within a Python file that defines the DAG's structure as code. You can add or update Apache Airflow DAGs on your Amazon MWAA environment using the DAGs folder in your S3 bucket.
Passing information between tasks/states Parameters can be shared in the workflow and jobs can refer them. Parameters can be passed between states. Global variables are supported.
Resuming failed tasks Possible to resume failed task. Possible to resume failed part by defining a new state machine. Possible to resume failed task.
Change management of workflow Not supported. Possible to delete/re-create workflow using blueprint, or you can modify the jobs and crawlers. Update DAG and reflect it in API/Console. Update DAG and re-deploy DAG.
Cost No additional cost Inexpensive with no operational overhead. $0.025 per 1,000 state transitions. Inexpensive with limited operational overhead. Detail pricing available on the MWAA Pricing page.

Using parameters

Most applications need to be configured in different ways to function in different environments. For example, there could be cases where the same business logic is submitted at different time of the day, by the same job, or there could be cases where input parameters change across different application environments (dev/test/production and so on) Like any application, AWS Glue jobs should also be designed with this reusability in mind.

AWS Glue jobs support the concept of parameters. These parameters can be used as a part of an independent job, or as a part of an AWS Glue workflow. Being able to pass parameters to the job makes it reusable, and reduces the number of code changes to meet future requirements. Details on how to work with parameters follows:

A utility function, getResolvedOptions, within the AWS Glue API lets you access the arguments that are passed to a job. The function is under awsglue.utils when working with Python. While using Scala, it is under com.amazonaws.services.glue.util.GlueArgParser.

The function prototype follows:

getResolvedOptions(args, options)
  • args — This corresponds to the list of arguments, implicitly available in the Scala args array or sys.argv in Python.

  • options — this corresponds to the array of argument names or job parameters that you want to retrieve.

Usage

Suppose you pass two parameters, LOCATION and DATE, to your job. The following section shows you the code snippets to retrieve them both.

Scala

import com.amazonaws.services.glue.util.GlueArgParser ... def main(args: Array[String]) { ... val jobParams = GlueArgParser.getResolvedOptions(args, Seq("JOB_NAME","LOCATION","DATE").toArray) val location = jobParams("LOCATION") val date = jobParams("DATE") ...

Python

import sys from awsglue.utils import getResolvedOptions ... args = getResolvedOptions(sys.argv, ['JOB_NAME', 'LOCATION', 'DATE']) location=args['LOCATION'] date=args['DATE'] ...

Now that you know how to access the parameters, learn how to pass parameters to a job.

Via the AWS Glue console

By default, the parameters of an AWS Glue job are empty but you can configure and save them via the AWS Glue console or via Glue Studio for future job runs. This way, you can re-run the job with these preset parameters, and you won’t have to type the values again.

To configure the parameters, expand the Security configuration, script libraries, and job parameters (optional) section of the AWS Glue job:

Add the parameters under the Job parameters heading. Notice the double dash ( -- ) before the parameter name.


            A screenshot that shows adding job parameters using the AWS Glue Console.

Adding job parameters using the AWS Glue console

Via API (boto3)

You can pass the parameters via API as well. However, this doesn’t save the parameter values and they need to be explicitly passed each time the job is run:

response = client.start_job_run( JobName = 'Job_With_Params', Arguments = { '--LOCATION': 'NY', '--DATE': '01-01-1990' } )

Via CLI

You can pass the parameters via CLI as well. However, this doesn’t save the parameter values and they need to be explicitly passed each time the job is run:

start-job-run --job-name Job_With_Params --arguments <value>

The arguments can be passed using the following shorthand syntax:

--arguments LOCATION=NY,DATE=01-01-1990

Or using JSON:

--arguments {"LOCATION":"NY","DATE":"01-01-1990"}

Using AWS Glue Studio

Similar to the AWS Glue console, AWS Glue Studio also allows the job parameters to be added and saved for future executions. On the AWS Glue Studio console, navigate to Job Details > Advanced Properties > Job parameters > Add new parameter:


            A screenshot of adding job parameters using AWS Glue Studio.

Adding job parameters using AWS Glue Studio