Adding Streaming ETL Jobs in AWS Glue - AWS Glue

Adding Streaming ETL Jobs in AWS Glue

You can create streaming extract, transform, and load (ETL) jobs that run continuously, consume data from streaming sources like Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). The jobs cleanse and transform the data, and then load the results into Amazon S3 data lakes or JDBC data stores.

By default, AWS Glue processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data arriving later than expected. You can modify this window size to increase timeliness or aggregation accuracy. AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read.

Creating a streaming ETL job involves the following steps:

  1. For an Apache Kafka streaming source, create an AWS Glue connection to the Kafka source or the Amazon MSK cluster.

  2. Manually create a Data Catalog table for the streaming source.

  3. Create an ETL job for the streaming data source. Define streaming-specific job properties, and supply your own script or optionally modify the generated script.

When creating a streaming ETL job for Amazon Kinesis Data Streams, you don't have to create an AWS Glue connection. However, if there is a connection attached to the AWS Glue streaming ETL job that has Kinesis Data Streams as a source, then a virtual private cloud (VPC) endpoint to Kinesis is required. For more information, see Creating an interface endpoint in the Amazon VPC User Guide.

Note

Glue version 2.0 does not support streaming ETL jobs. For more information, see Running Spark ETL Jobs with Reduced Startup Times.

Creating an AWS Glue Connection for an Apache Kafka Data Stream

To read from an Apache Kafka stream, you must create an AWS Glue connection.

To create an AWS Glue connection for a Kafka source (Console)

  1. Open the AWS Glue console at https://console.aws.amazon.com/glue/.

  2. In the navigation pane, under Data catalog, choose Connections.

  3. Choose Add connection, and on the Set up your connection’s properties page, enter a connection name.

  4. For Connection type, choose Kafka.

  5. For Kafka bootstrap servers URLs, enter the host and port number for the bootstrap brokers for your Amazon MSK cluster or Apache Kafka cluster. Use only Transport Layer Security (TLS) endpoints for establishing the initial connection to the Kafka cluster. Plain text endpoints are not supported.

    The following is an example list of host name and port number pairs for an Amazon MSK cluster.

    myserver1.kafka.us-east-1.amazonaws.com:9094,myserver2.kafka.us-east-1.amazonaws.com:9094, myserver3.kafka.us-east-1.amazonaws.com:9094

    For more information about getting the bootstrap broker information, see Getting the Bootstrap Brokers for an Amazon MSK Cluster in the Amazon Managed Streaming for Apache Kafka Developer Guide.

  6. If you want a secure connection to the Kafka data source, select Require SSL connection, and for Kafka private CA certificate location, enter a valid Amazon S3 path to a custom SSL certificate.

    For an SSL connection to self-managed Kafka, the custom certificate is mandatory. It's optional for Amazon MSK.

    For more information about specifying a custom certificate for Kafka, see AWS Glue Connection SSL Properties.

  7. Optionally enter a description, and then choose Next.

  8. For an Amazon MSK cluster, specify its virtual private cloud (VPC), subnet, and security group. The VPC information is optional for self-managed Kafka.

  9. Choose Next to review all connection properties, and then choose Finish.

For more information about AWS Glue connections, see AWS Glue Connections.

Creating a Data Catalog Table for a Streaming Source

Before creating a streaming ETL job, you must manually create a Data Catalog table that specifies source data stream properties, including the data schema. This table is used as the data source for the streaming ETL job. If you don't know the schema of the data in the source data stream, you can create the table without a schema. Then when you create the streaming ETL job, you can enable the AWS Glue schema detection function. AWS Glue then determines the schema from the streaming data.

Use the AWS Glue console, the AWS Command Line Interface (AWS CLI), or the AWS Glue API to create the table. For information about creating a table manually with the AWS Glue console, see Defining Tables in the AWS Glue Data Catalog.

Note

You can't use the AWS Lake Formation console to create the table; you must use the AWS Glue console.

When creating the table, set the following streaming ETL properties.

Type of Source

Kinesis or Kafka

For a Kinesis source:
Stream name

Stream name as described in Creating a Stream in the Amazon Kinesis Data Streams Developer Guide.

Kinesis source URL

Fully qualified URL of the Amazon Kinesis Data Streams service.

Example: https://kinesis.us-east-1.amazonaws.com

For a Kafka source:
Topic name

Topic name as specified in Kafka.

Connection

An AWS Glue connection that references a Kafka source, as described in Creating an AWS Glue Connection for an Apache Kafka Data Stream.

Notes and Restrictions for Avro Streaming Sources

The following notes and restrictions apply for streaming sources in the Avro format:

  • When schema detection is enabled, the Avro schema must be included in the payload. When disabled, the payload should contain only data.

  • Some Avro data types are not supported in dynamic frames. You can't specify these data types when defining the schema with the Define a schema page in the create table wizard in the AWS Glue console. During schema detection, unsupported types in the Avro schema are converted to supported types as follows:

    • EnumType => StringType

    • FixedType => BinaryType

    • UnionType => StructType

  • If you define the table schema using the Define a schema page in the console, the implied root element type for the schema is record. If you want a root element type other than record, for example array or map, you can't specify the schema using the Define a schema page. Instead you must skip that page and specify the schema either as a table property or within the ETL script.

    • To specify the schema in the table properties, complete the create table wizard, edit the table details, and add a new key/value pair under Table properties. Use the key avroSchema, and enter a schema JSON object for the value, as shown in the following screenshot.

      
                                    Under the Table properties heading,
                                        there are two columns of text fields. The left-hand column
                                        heading is Key, and the right-hand
                                        column heading is Value. The key/value
                                        pair in the first row is classification/avro. The key/value
                                        pair in the second row is
                                        avroSchema/{"type":"array","items":"string"}.
    • To specify the schema in the ETL script, modify the datasource0 assignment statement and add the avroSchema key to the additional_options argument, as shown in the following Python and Scala examples.

      Python
      SCHEMA_STRING = ‘{"type":"array","items":"string"}’ datasource0 = glueContext.create_data_frame.from_catalog(database = "database", table_name = "table_name", transformation_ctx = "datasource0", additional_options = {"startingPosition": "TRIM_HORIZON", "inferSchema": "false", "avroSchema": SCHEMA_STRING})
      Scala
      val SCHEMA_STRING = """{"type":"array","items":"string"}""" val datasource0 = glueContext.getCatalogSource(database = "database", tableName = "table_name", redshiftTmpDir = "", transformationContext = "datasource0", additionalOptions = JsonOptions(s"""{"startingPosition": "TRIM_HORIZON", "inferSchema": "false", "avroSchema":"$SCHEMA_STRING"}""")).getDataFrame()

Defining Job Properties for a Streaming ETL Job

When you define a streaming ETL job on the AWS Glue console, provide the following streams-specific properties. For descriptions of additional job properties, see Defining Job Properties. For more information about adding a job using the AWS Glue console, see Working with Jobs on the AWS Glue Console.

IAM role

Specify the AWS Identity and Access Management (IAM) role that is used for authorization to resources that are used to run the job, access streaming sources, and access target data stores.

For access to Amazon Kinesis Data Streams, attach the AmazonKinesisFullAccess AWS managed policy to the role, or attach a similar IAM policy that permits more fine-grained access. For sample policies, see Controlling Access to Amazon Kinesis Data Streams Resources Using IAM.

For more information about permissions for running jobs in AWS Glue, see Managing Access Permissions for AWS Glue Resources.

Type

Choose Spark streaming.

Glue version

Must be 1.0. AWS Glue 2.0 currently does not support streaming ETL jobs.

Job timeout

Optionally enter a duration in minutes. If you leave this field blank, the job runs continuously.

Data source

Specify the table that you created in Creating a Data Catalog Table for a Streaming Source.

Data target

Do one of the following:

  • Choose Create tables in your data target and specify the following data target properties.

    Data store

    Choose Amazon S3 or JDBC.

    Format

    Choose any format. All are supported for streaming.

  • Choose Use tables in the data catalog and update your data target, and choose a table for a JDBC data store.

Output schema definition

Do one of the following:

  • Choose Automatically detect schema of each record to enable schema detection. AWS Glue determines the schema from the streaming data.

  • Choose Specify output schema for all records to use the Apply Mapping transform to define the output schema.

Script

Optionally supply your own script or modify the generated script to perform operations that the Apache Spark Structured Streaming engine supports. For information on the available operations, see Operations on streaming DataFrames/Datasets.

Streaming ETL Notes and Restrictions

Keep in mind the following notes and restrictions:

  • When using schema detection, you cannot perform joins of streaming data.

  • Your ETL script can use AWS Glue’s built-in transforms and the transforms native to Apache Spark Structured Streaming. For more information, see Operations on streaming DataFrames/Datasets on the Apache Spark website or Built-In Transforms.

  • AWS Glue streaming ETL jobs use checkpoints to keep track of the data that has been read. Therefore, a stopped and restarted job picks up where it left off in the stream. If you want to reprocess data, you can delete the checkpoint folder referenced in the script.

  • Job bookmarks aren't supported.

  • You can't change the number of shards of an Amazon Kinesis data stream if an AWS Glue streaming job is running and consuming data from that stream. Stop the job first, modify the stream shards, and then restart the job.

  • Kinesis streams must be in the same account as the AWS Glue job.

  • You cannot register a job as a consumer for the enhanced fan-out feature of Kinesis Data Streams.