Getting started with serverless ETL on AWS Glue - AWS Prescriptive Guidance

Getting started with serverless ETL on AWS Glue

Adnan Alvee, Associate Consultant, Amazon Web Services

January 2021

On the Amazon Web Services (AWS) Cloud, AWS Glue provides a fully managed serverless environment where you can extract, transform, and load (ETL) your data. AWS Glue makes it cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams.

AWS Glue is serverless, so you don’t need to set up any cluster or infrastructure, and you can scale up and down as you need. AWS Glue charges only for the amount of compute time it needs to finish a specific workload. Instead of juggling between various instance types to match a workload, AWS Glue comes with the capability of choosing between three worker types recommended for various workloads:

  • Standard

  • G.1X (for memory-intensive workloads)

  • G.2X (for workloads with ML transforms)

AWS Glue consists of the following components:

  • AWS Glue ETL – AWS Glue ETL gives you batch and streaming options to author code and move, transform, and aggregate data from one source to another.

  • AWS Glue Data Catalog – Data Catalog provides a view of the metadata of all your data along with options to crawl specific data sources.

  • AWS Glue DataBrew – DataBrew is a visual data preparation tool that data analysts and data scientists can use to clean and normalize data. You can choose from more than 250 prebuilt transformations to automate data preparation tasks, all without the need to write any code.

This guide gives you a brief introduction to each of the AWS Glue components and the key concepts and features, such as logging and monitoring, and automation, that you should know before authoring ETL jobs. The Next steps section can get you up to speed with writing code in AWS Glue. If you already have used AWS Glue to some extent, you can use the Best practices section to brush up on any gaps.