Best practices - AWS Prescriptive Guidance

Best practices

Following are some of the best practices, from beginner level to advanced level, from the AWS Glue documentation and related blog posts.

Run locally first

To save on cost and time while building your ETL jobs, start the development locally to test your code and business logic. This is a best practice even if you are experienced. For instructions on setting up a Docker container that can help write AWS Glue ETL jobs both in a shell and in an integrated development environment (IDE), see the blog post Developing AWS Glue ETL jobs locally using a container.

Use a development endpoint

You can use a development endpoint with AWS Glue ETL for PySpark and Scala. By using the development endpoint, you can test your code on the real dataset. Use a sample size from your original dataset. To get started, follow the steps in Adding a development endpoint.

Use partitioning to query exactly what you need

You can use partitioning to filter the data at loading time instead loading the entire dataset. The blog post Work with partitioned data in AWS Glue illustrates partitioning data and the predicate pushdown feature of AWS Glue.

Optimize memory management

One of the most crucial factors in writing AWS Glue jobs is memory management because AWS Glue uses Apache Spark for its Spark-related ETL jobs. For more information, see the blog post Optimize memory management in AWS Glue, which talks about how you can use the following:

  • Amazon S3 list implementation of AWS Glue

  • Grouping

  • Excluding Amazon S3 paths that are not required for the job

  • Filtering on Amazon S3 storage class

  • Spark read partitioning

  • AWS Glue read partitioning

  • Bulk inserts

  • JDBC optimizations

  • Join optimizations

  • PySpark user-defined functions (UDFs)

  • Incremental processing

Use the appropriate type of scaling

Understanding how to scale is crucial in writing performant ETL jobs. Compute-intensive AWS Glue jobs that possess a high degree of data parallelism can benefit from horizontal scaling (more standard or G1.X workers). ETL jobs that need high memory or ample disk space to store intermediate shuffle output can benefit from vertical scaling (more G1.X or G2.X workers). The blog post Best practices to scale Apache Spark jobs and partition data with AWS Glue talks about worker types and scaling in AWS Glue.