Optimize the ETL ingestion of input file size on AWS
Created by Apoorva Patrikar (AWS)
Environment: PoC or pilot | Technologies: Analytics; Data lakes | Workload: Open-source |
AWS services: AWS Glue; Amazon S3 |
Summary
This pattern shows you how to optimize the ingestion step of the extract, transform, and load (ETL) process for big data and Apache Spark workloads on AWS Glue by optimizing file size before processing your data. Use this pattern to prevent or resolve the small files problem. That is, when a large number of small files slows down data processing due to the aggregate size of the files. For example, hundreds of files that are only a few hundred kilobytes each can significantly slow down data processing speeds for your AWS Glue jobs. This is because AWS Glue must perform internal list functions on Amazon Simple Storage Service (Amazon S3) and YARN (Yet Another Resource Negotiator) must store a large amount of metadata. To improve data processing speeds, you can use grouping to enable your ETL tasks to read a group of input files into a single in-memory partition. The partition automatically groups smaller files together. Alternatively, you can use custom code to add batch logic to your existing files.
Prerequisites and limitations
Prerequisites
An active AWS account
One or more AWS glue jobs
One or more big data or Apache Spark
workloads An S3 bucket
Architecture
The following pattern shows how data in different formats is processed by an AWS Glue job and then stored in an S3 bucket to get visibility into performance.
The diagram shows the following workflow:
An AWS Glue job converts small files in CSV, JSON, and Parquet format to dynamic frames. Note: The size of the input file has the most significant impact on the performance of the AWS Glue job.
The AWS Glue job performs internal list functions in an S3 bucket.
Tools
AWS Glue is a fully managed ETL service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Epics
Task | Description | Skills required |
---|---|---|
Specify the group size. | If you have more than 50,000 files, grouping is done by default. However, you can also use grouping for less than 50,000 files by specifying the group size in the | Data engineer |
Write the grouping code. | Use the
Note: Use | Data engineer |
Add the code to the workflow. | Add the grouping code to your job workflow in AWS Glue. | Data engineer |
Task | Description | Skills required |
---|---|---|
Choose the language and processing platform. | Choose the scripting language and processing platform tailored to your use case. | Cloud architect |
Write the code. | Write the custom logic to batch your files together. | Cloud architect |
Add the code to the workflow. | Add the code to your job workflow in AWS Glue. This enables your custom logic to be applied every time the job is run. | Data engineer |
Task | Description | Skills required |
---|---|---|
Analyze consumption patterns. | Find out how downstream applications will use the data you write. For example, if they query data each day and you only partition data per Region or have very small output files, such as 2.5 KB per file, then this is not optimal for consumption. | DBA |
Repartition data before writing. | Repartition based on joins or queries during processing (based on processing logic) and after processing (based on consumption). For example, repartition based on byte size, such as | Data engineer |
Related resources
Additional information
Determining file size
There is no straightforward way to determine if a file size is too big or too small. The impact of file size on processing performance depends on the configuration of your cluster. In core Hadoop, we recommend that you use files that are 128 MB or 256 MB to make the most of the block size.
For most text file workloads on AWS Glue, we recommended a file size between 100 MB and 1 GB for a 5-10 DPU cluster. To figure out the best size of input files, monitor the preprocessing section of your AWS Glue job, and then check the CPU utilization and memory utilization of the job.
Additional considerations
If performance in the early ETL stages is a bottleneck, consider grouping or merging the data files before processing. If you have complete control on the file generation process, it can be even more efficient to aggregate data points on the source system itself before the raw data is sent to AWS.