Building a cost-effective data pipeline - AWS Glue Best Practices: Building a Performant and Cost Optimized Data Pipeline

Building a cost-effective data pipeline

A cost-optimized data pipeline fully uses all resources, achieves an outcome at the lowest possible price point, and meets your functional requirements. Here we are providing best practices to optimized price performance of your data pipeline.

The right AWS Glue worker type

This section of the document discusses the different worker nodes available in AWS Glue, their differences, and provides guidance on selecting the appropriate worker type based on your workload. 

The following table summarizes the available AWS Glue worker types:

Table 2 — AWS Glue worker types

Worker name vCPU Memory (GB) Attached storage (GB)
Standard 4 16 50
G.1X 4 16 64
G.2X 8 32 128

When creating an AWS Glue job with either of these worker types, the following rule applies:

Standard 

  • You specify the maximum number of Data Processing Units (DPUs) required for the job

  • Each standard worker launches two executors

  • Each executor launches with four Spark cores

G.1X 

  • You specify the maximum number of workers

  • Each worker corresponds to one DPU

  • Each worker launches one executor

  • Each executor launches with eight Spark cores

  • In AWS Glue 3.0, each job launches with four cores per executor

G.2X

  • You specify the maximum number of workers

  • Each worker corresponds to two DPUs

  • Each worker launches one executor

  • Each executor launches with 16 Spark cores

  • In Glue 3.0, each job launches with eight cores per executor

We recommend using G.1X or G.2X workers for jobs authored in AWS Glue 2.0 and above. Based on whether your job requires more data parallelism, (for example, they benefit from horizontal scaling) adding more G.1X workers is recommended. For jobs that have intense memory requirements - or ones that benefit from vertical scaling - adding more G.2X workers is recommended. Additionally, the G.2X jobs benefit from having additional disk space. 

Estimate AWS Glue DPU

AWS Glue has autoscaling feature which helps to avoid the complexities involved in calculating the right number of DPUs for a job. AWS Glue 3.0 jobs can be configured to auto-scale, meaning the jobs can now dynamically scale resources up and down based on the workload, for both batch and streaming jobs. With autoscaling, there is no longer a need to worry about over-provisioning resources for jobs, spend time optimizing the number of workers, or pay for idle workers.

Common scenarios where automatic scaling helps with cost and utilization for your Spark applications include a Spark driver listing a large number of files in Amazon S3 or performing a load while executors are inactive, Spark stages running with only a few executors due to overprovisioning, and data skews or uneven computation demand across Spark stages.

To enable autoscaling, set the --enable-auto-scaling flag to true, or enable it manually from AWS Glue Studio while authoring the job. Additionally, choose the type and maximum number of workers and AWS Glue will choose the right size resources for the workload.

automatic scaling is available for AWS Glue jobs with both G1.X and G2.X worker types. Standard DPUs are not supported.

When not using autoscaling, we use a rough calculator to try estimate the AWS Glue job’s DPU requirement, the following section provides more details on the approach.

For estimating the DPU requirements at job level, let’s break down the jobs into different complexity grades - Low, Medium, and High. The sizing of the jobs is purely based on the number of transformations. 

A job that does only source A to target B data movement, with no transformation or with minor data filtering, can be considered Low on the complexity scale. Similarly, a job that involves multiple joins, UDFs, window functions, and so on can be considered a High complexity job.

The maximum number of workers you can define is 299 for G.1X, and 149 for G.2X. These are not hard limits and can be increased.

Let’s attach the following weights to each complexity scale:

Table 3 — Complexity weight by complexity level

Complexity level Weight
Low 2
Medium 6
High 10

Next, we apply the following formula to calculate the DPU requirements for a job based on G.1X worker. 

DPU Estimate = MIN((CEIL(((data_volume_in_GB * weight)/16),1)+1,299)

Let’s consider the following scenario:

Table 4 — Sample low complexity job

Job name Job 1
Profile Low
Data volume 160 GB

Based on the previous scenario, the following calculation applies:

DPU Estimate = MIN(CEIL((160*2)/16,1)+1,299) = MIN (21,299) = 21

For the same data input, the following table lists the DPU estimates for each complexity level:

Table 5 — DPU estimate by complexity level

Complexity level DPU estimate
Low 21
Medium 61
High 101

Be advised that the data needs to be partitioned and should have at least as many partitions as the number of spark cores in order to efficiently process the data. The calculations above are designed to assist you with getting started with a worker configuration. Once you set up and run your AWS Glue jobs, you will be able to monitor for the actual usage which may be just right with the demand, slightly over or below. Based on the outcome you can adjust and further optimize your worker counts to meet your processing requirements.