Best practices - AWS Prescriptive Guidance

Best practices

When developing with AWS Glue, consider the following best practices.

Develop locally first

To save on cost and time while building your ETL jobs, test your code and business logic locally first. For instructions on setting up a Docker container that can help you test AWS Glue ETL jobs both in a shell and in an integrated development environment (IDE), see the blog post Develop and test AWS Glue jobs locally using a Docker container.

Use AWS Glue interactive sessions

AWS Glue interactive sessions provide a serverless Spark backend, coupled with an open-source Jupyter kernel that integrates with notebooks and IDEs such as PyCharm, IntelliJ, and VS Code. By using interactive sessions, you can test your code on real datasets with the AWS Glue Spark backend and the IDE of your choice. To get started, follow the steps in Getting started with AWS Glue interactive sessions.

Use partitioning to query exactly what you need

Partitioning refers to dividing a large dataset into smaller partitions based on specific columns or keys. When data is partitioned, AWS Glue can perform selective scans on a subset of data that satisfies specific partitioning criteria, rather than scanning the entire dataset. This results in faster and more efficient query processing, especially when working with large datasets.

Partition data based on the queries that will be run against it. For example, if most queries filter on a particular column, partitioning on that column can greatly reduce query time. To learn more about partitioning data, see Work with partitioned data in AWS Glue.

Optimize memory management

Memory management is crucial when writing AWS Glue ETL jobs because they run on the Apache Spark engine, which is optimized for in-memory processing. The blog post Optimize memory management in AWS Glue provides details on the following memory management techniques:

  • Amazon S3 list implementation of AWS Glue

  • Grouping

  • Excluding irrelevant Amazon S3 paths and storage classes

  • Spark and AWS Glue read partitioning

  • Bulk inserts

  • Join optimizations

  • PySpark user-defined functions (UDFs)

  • Incremental processing

Use efficient data storage formats

When authoring ETL jobs, we recommend outputting transformed data in a column-based data format. Columnar data formats, such as Apache Parquet and ORC, are commonly used in big data storage and analytics. They are designed to minimize data movement and maximize compression. These formats also enable splitting data to multiple readers for increased parallel reads during query processing.

Compressing data also helps reduce the amount of data stored, and it improves read/write operation performance. AWS Glue supports multiple compression formats natively. For more information about efficient data storage and compression, see Building a performance efficient data pipeline.

Use the appropriate type of scaling

Understanding when to scale horizontally (change the number of workers) or scale vertically (change the worker type) is important for AWS Glue because it can impact the cost and performance of the ETL jobs.

Generally, ETL jobs with complex transformations are more memory-intensive and require vertical scaling (for example, moving from the G.1X to the G.2X worker type). For compute-intensive ETL jobs with large volumes of data, we recommend horizontal scaling because AWS Glue is designed to process that data in parallel across multiple nodes.

Closely monitoring AWS Glue job metrics in Amazon CloudWatch helps you determine whether a performance bottleneck is caused by a lack of memory or compute. For more information about AWS Glue worker types and scaling, see Best practices to scale Apache Spark jobs and partition data with AWS Glue.