Optimize the ETL ingestion of input file size on AWS - AWS Prescriptive Guidance

Optimize the ETL ingestion of input file size on AWS

Created by Apoorva Patrikar (AWS)

Environment: PoC or pilot

Technologies: Analytics; Data lakes

Workload: Open-source

AWS services: AWS Glue; Amazon S3

Summary

This pattern shows you how to optimize the ingestion step of the extract, transform, and load (ETL) process for big data and Apache Spark workloads on AWS Glue by optimizing file size before processing your data. Use this pattern to prevent or resolve the small files problem. That is, when a large number of small files slows down data processing due to the aggregate size of the files. For example, hundreds of files that are only a few hundred kilobytes each can significantly slow down data processing speeds for your AWS Glue jobs. This is because AWS Glue must perform internal list functions on Amazon Simple Storage Service (Amazon S3) and YARN (Yet Another Resource Negotiator) must store a large amount of metadata. To improve data processing speeds, you can use grouping to enable your ETL tasks to read a group of input files into a single in-memory partition. The partition automatically groups smaller files together. Alternatively, you can use custom code to add batch logic to your existing files.

Prerequisites and limitations

Prerequisites

Architecture

The following pattern shows how data in different formats is processed by an AWS Glue job and then stored in an S3 bucket to get visibility into performance.

Data in different formats is processed by an AWS Glue job and then stored in an S3 bucket.

The diagram shows the following workflow:

  1. An AWS Glue job converts small files in CSV, JSON, and Parquet format to dynamic frames. Note: The size of the input file has the most significant impact on the performance of the AWS Glue job.

  2. The AWS Glue job performs internal list functions in an S3 bucket.

Tools

  • AWS Glue is a fully managed ETL service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.

  • Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

Epics

TaskDescriptionSkills required

Specify the group size.

If you have more than 50,000 files, grouping is done by default. However, you can also use grouping for less than 50,000 files by specifying the group size in the connectionOptions parameter. The connectionOptions parameter is in the create_dynamic_frame.from_options method.

Data engineer

Write the grouping code.

Use the create_dynamic_frame method to create a dynamic frame. For example:

S3bucket_node1 = glueContext.create_dynamic_frame.from_options( format_options={"multiline": False}, connection_type="s3", format="json", connection_options={ "paths": ["s3://bucket/prefix/file.json"], "recurse": True, "groupFiles": 'inPartition', "groupSize": 1048576 }, transformation_ctx="S3bucket_node1", )

Note: Use groupFiles to group files in an Amazon S3 partition group. Use groupSize to set the target size of the group to be read in memory. Specify groupSize in bytes (1048576 = 1 MB).

Data engineer

Add the code to the workflow.

Add the grouping code to your job workflow in AWS Glue.

Data engineer
TaskDescriptionSkills required

Choose the language and processing platform.

Choose the scripting language and processing platform tailored to your use case.

Cloud architect

Write the code.

Write the custom logic to batch your files together.

Cloud architect

Add the code to the workflow.

Add the code to your job workflow in AWS Glue. This enables your custom logic to be applied every time the job is run.

Data engineer
TaskDescriptionSkills required

Analyze consumption patterns.

Find out how downstream applications will use the data you write. For example, if they query data each day and you only partition data per Region or have very small output files, such as 2.5 KB per file, then this is not optimal for consumption.

DBA

Repartition data before writing.

Repartition based on joins or queries during processing (based on processing logic) and after processing (based on consumption). For example, repartition based on byte size, such as .repartition(100000), or repartition based on columns, such as .repartition("column_name").

Data engineer

Related resources

Additional information

Determining file size

There is no straightforward way to determine if a file size is too big or too small. The impact of file size on processing performance depends on the configuration of your cluster. In core Hadoop, we recommend that you use files that are 128 MB or 256 MB to make the most of the block size.

For most text file workloads on AWS Glue, we recommended a file size between 100 MB and 1 GB for a 5-10 DPU cluster. To figure out the best size of input files, monitor the preprocessing section of your AWS Glue job, and then check the CPU utilization and memory utilization of the job.

Additional considerations

If performance in the early ETL stages is a bottleneck, consider grouping or merging the data files before processing. If you have complete control on the file generation process, it can be even more efficient to aggregate data points on the source system itself before the raw data is sent to AWS.