Adding Streaming ETL Jobs in AWS Glue - AWS Glue

Adding Streaming ETL Jobs in AWS Glue

You can create streaming extract, transform, and load (ETL) jobs that run continuously, consume data from streaming sources like Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), perform transforms, and load the results into Amazon S3 data lakes or JDBC data stores.

By default, AWS Glue processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data arriving later than expected. You can modify this window size to increase timeliness or aggregation accuracy. AWS Glue uses checkpoints rather than job bookmarks to track the data that has been read.

Creating a streaming ETL job involves the following steps:

  1. For an Apache Kafka streaming source, create an AWS Glue connection to the Amazon MSK cluster.

  2. Manually create a Data Catalog table for the streaming source.

  3. Create an ETL job for the streaming data source. Define streaming-specific job properties, and supply your own script or optionally modify the generated script.

When creating a streaming ETL job for Amazon Kinesis Data Streams, you do not have to create an AWS Glue connection. However, if there is a connection attached to the AWS Glue streaming job that has Kinesis Data Streams as a source, then a virtual private cloud (VPC) endpoint to Kinesis is required. For more information, see Creating an interface endpoint in the Amazon VPC User Guide.

Creating an AWS Glue Connection for an Apache Kafka Data Stream

To read from an Apache Kafka stream, you must create an AWS Glue connection.

To create an AWS Glue connection for a Kafka source (Console)

  1. Open the AWS Glue console at https://console.aws.amazon.com/glue/.

  2. In the navigation pane, under Data catalog, choose Connections.

  3. Choose Add connection, and on the Set up your connection’s properties page, enter a connection name.

  4. For Connection type, choose Kafka.

  5. For Kafka bootstrap servers URLs, enter the host and port number for the bootstrap brokers for an Amazon MSK cluster. For more information about getting the bootstrap broker information, see Getting the Bootstrap Brokers for an Amazon MSK Cluster in the Amazon Managed Streaming for Apache Kafka Developer Guide. Use only Transport Layer Security (TLS) endpoints for establishing the initial connection to the Kafka cluster. Plaintext endpoints are not supported.

    The following is an example list of host name and port number pairs for an Amazon MSK cluster.

    myserver1.kafka.us-east-1.amazonaws.com:9094,myserver2.kafka.us-east-1.amazonaws.com:9094,myserver3.kafka.us-east-1.amazonaws.com:9094
  6. Complete the connection wizard. You must specify the following for the Amazon MSK cluster: the virtual private cloud (VPC) ID, one of the subnets used by the cluster, and the security group ID that has access to the cluster.

    For more information about AWS Glue connections, see AWS Glue Connections.

Creating a Data Catalog Table for a Streaming Source

Before creating a streaming ETL job, you must manually create a Data Catalog table that specifies source data stream properties. This table is used as the data source for the streaming ETL job.

Use the AWS Glue console, the AWS Command Line Interface (AWS CLI), or the AWS Glue API to create the table. For information about creating a table manually with the AWS Glue console, see Defining Tables in the AWS Glue Data Catalog.

Note

You can't use the AWS Lake Formation console to create the table; you must use the AWS Glue console.

When creating the table, set the following streams-specific properties, and be sure to fully specify the table schema.

Type of Source

Kinesis or Kafka

For a Kinesis source:
Stream name

Stream name as described in Creating a Stream in the Amazon Kinesis Data Streams Developer Guide.

Kinesis source URL

Fully qualified URL of the Amazon Kinesis Data Streams service.

Example: https://kinesis.us-east-1.amazonaws.com

For a Kafka source:
Topic name

Topic name as specified in Kafka.

Connection

An AWS Glue connection that references a Kafka source, as described in Creating an AWS Glue Connection for an Apache Kafka Data Stream.

Defining Job Properties for a Streaming ETL Job

When you define a streaming ETL job on the AWS Glue console, provide the following streams-specific properties. For descriptions of additional job properties, see Defining Job Properties. For more information about adding a job using the AWS Glue console, see Working with Jobs on the AWS Glue Console.

IAM role

Specify the AWS Identity and Access Management (IAM) role that is used for authorization to resources that are used to run the job, access streaming sources, and access target data stores.

For access to Amazon Kinesis Data Streams, attach the AmazonKinesisFullAccess AWS managed policy to the role, or attach a similar IAM policy that permits more fine-grained access. For sample policies, see Controlling Access to Amazon Kinesis Data Streams Resources Using IAM.

For more information about permissions for running jobs in AWS Glue, see Managing Access Permissions for AWS Glue Resources.

Type

Choose Spark streaming.

Glue version

Must be 1.0 or later.

Maximum capacity

Set a maximum capacity that is less than or equal to the number of shards/partitions in the source stream. This setting prevents AWS Glue from allocating capacity that can't be used. The default is 5 DPUs.

Job timeout

Optionally enter a duration in minutes. If you leave this field blank, the job runs continuously.

Data source

Specify the table that you created in Creating a Data Catalog Table for a Streaming Source.

Data target

Do one of the following:

  • Choose Create tables in your data target and specify the following data target properties.

    Data store

    Choose Amazon S3 or JDBC.

    Format

    Choose any format. All are supported for streaming.

  • Choose Use tables in the data catalog and update your data target and choose a table for a JDBC data store.

Script

Optionally supply your own script or modify the generated script to perform operations that the Apache Spark Structured Streaming engine supports. For information on the available operations, see Operations on streaming DataFrames/Datasets.

Streaming ETL Notes and Restrictions

Keep in mind the following notes and restrictions:

  • Your ETL script can use AWS Glue’s built-in transforms and the transforms native to Apache Spark Structured Streaming. For more information, see Operations on streaming DataFrames/Datasets on the Apache Spark website or Built-In Transforms.

  • AWS Glue uses checkpoints to keep track of the data that has been read. Therefore, a stopped and restarted job picks up where it left off in the stream. If you want to reprocess data, you can delete the checkpoint folder referenced in the script.

  • Job bookmarks are not supported.

  • The Avro format is supported for data targets, but not for data sources.

  • You can't change the number of shards of an Amazon Kinesis data stream if an AWS Glue streaming job is running and consuming data from that stream. Stop the job first, modify the stream shards, and then restart the job.

  • Kinesis streams must be in the same account as the AWS Glue job.

  • Apache Kafka is currently supported only through Amazon MSK.

  • You cannot register a job as a consumer for the enhanced fan-out feature of Kinesis Data Streams.