AWS Glue ETL - AWS Prescriptive Guidance

AWS Glue ETL

You can use AWS Glue ETL to move, clean, and transform data from one source to another by using its own functions, Apache Spark, and Python. For more information about supported sources, see Connection types and options for ETL in AWS Glue.

Authoring in AWS Glue ETL

AWS Glue ETL code can be authored in the following ways:

  • Python shell – When the size of the data is very small, you can use Python shell to write Python scripts to manipulate data.

  • Spark – You can use Scala and Python to author ETL jobs with AWS Glue and Apache Spark.

  • Spark Streaming – To enrich, aggregate, and combine streaming data, you can use streaming ETL jobs. In AWS Glue, streaming ETL jobs run on the Apache Spark Structured Streaming engine.

  • AWS Glue Studio – If you are new to Apache Spark programming or are accustomed to ETL tools with boxes-and-arrows interfaces, get started by using AWS Glue Studio.

DPUs and worker types

AWS Glue uses data processing units (DPUs), which are also known as worker types. DPUs can be allocated based on the data and velocity needs of a specific use case.

Using Python shell

You can use either 1 DPU to use 16 GB of memory or 0.0625 DPU to use 1 GB of memory. Note that a Python shell job does not use the Apache Spark environment to run Python, so it is not shown in the following table. Python shell is for jobs where the size of data is very small.

AWS Glue Python or Scala on Apache Spark

The following table shows the different AWS Glue worker types for the Apache Spark run environment for batch, streaming, and AWS Glue Studio workloads. Note that with AWS Glue Studio, you can use only G.1X and G.2X worker types.

Standard G.1X G.2X

vCPU

4

4

8

Memory

16 GB

16 GB

32 GB

Disk space

50

64

128

Executor per worker

2

1

1