Streaming ingestion - Amazon Redshift

Streaming ingestion

Previously, loading data from a streaming service like Amazon Kinesis into Amazon Redshift included several steps. These included connecting the stream to Amazon Kinesis Data Firehose and waiting for Kinesis Data Firehose to stage the data in Amazon S3, using various-sized batches at varying-length buffer intervals. After this, Kinesis Data Firehose initiated a COPY command to load the data from Amazon S3 to a table in Redshift.

Rather than staging in Amazon S3, streaming ingestion provides low-latency, high-speed ingestion of stream data from Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka into an Amazon Redshift materialized view. Thus, it lowers the time it takes to access data and it reduces storage cost. You can configure streaming ingestion for your Amazon Redshift cluster or for Amazon Redshift Serverless and create a materialized view, using SQL statements, as described in Creating materialized views in Amazon Redshift. After that, using materialized view refresh, you can ingest hundreds of megabytes of data per second. This results in fast access to external data that is quickly refreshed.

Data flow

An Amazon Redshift provisioned cluster is the stream consumer. A materialized view is the landing area for data read from the stream, which is processed as it arrives. For instance, JSON values can be consumed and mapped to the materialized view's data columns, using familiar SQL. When the materialized view is refreshed, Amazon Redshift compute nodes allocate each Kinesis data shard or Kafka partition to a compute slice. Each slice consumes data from the allocated shards until the view reaches parity with the SEQUENCE_NUMBER for the Kinesis stream or last Offset for the Kafka topic. Subsequent materialized view refreshes read data from the last SEQUENCE_NUMBER of the previous refresh until it reaches parity with the stream or topic data.

Streaming ingestion use cases

Use cases for Amazon Redshift streaming ingestion involve working with data that is generated continually (streamed) and must be processed within a short period (latency) of its generation. This is called near real-time analytics. Sources of data can vary, and include IoT devices, system telemetry data, or clickstream data from a busy website or application.

Streaming ingestion considerations

The following are important considerations and best practices for performance and billing as you set up your streaming ingestion environment.

  • Auto refresh usage - Auto refresh queries for a materialized view or views are treated as any other user workload. Auto refresh loads data from the stream as it arrives.

  • Streaming ingestion and Amazon Redshift Serverless - The same setup and configuration instructions that apply to Amazon Redshift streaming ingestion on a provisioned cluster also apply to streaming ingestion on Amazon Redshift Serverless. It's important to size Amazon Redshift Serverless with the necessary level of RPUs to support streaming ingestion with auto refresh and other workloads. For more information, see Billing for Amazon Redshift Serverless.

  • Amazon Redshift nodes in a different availability zone than the Amazon MSK cluster - When you configure streaming ingestion, Amazon Redshift attempts to connect to an Amazon MSK cluster in the same Availability Zone, if rack awareness is enabled for Amazon MSK. If all of your nodes are in different Availability Zones than your Amazon Redshift cluster, you can incur cross Availability Zone data-transfer cost. To avoid this, keep at least one Amazon MSK broker cluster node in the same AZ as your Amazon Redshift cluster.

  • Refresh start location - After creating a materialized view, its initial refresh starts from the TRIM_HORIZON of a Kinesis stream, or from offset 0 of an Amazon MSK topic.

  • Data formats - Supported data formats are limited to those that can be converted from VARBYTE. For more information, see VARBYTE type and VARBYTE operators.

  • Streaming to multiple materialized views - In Amazon Redshift, we recommend in most cases that you land the data for each stream in a single materialized view. However, it is possible to ingest a stream and land the data in multiple materialized views. For instance, a use case where you ingest a stream containing sports data, but you organize data for each sport into a separate materialized views.

    Note that when you ingest data into and refresh multiple materialized views, there can be higher egress costs, specifically for reading data from the streaming provider. Additionally, higher resource use for reading into more than one materialized view can impact other workloads. Also note bandwidth, throughput and performance limitations for your streaming provider. For more information about pricing for data streams, see Kinesis Data Streams pricing and Amazon Managed Streaming for Apache Kafka pricing.

Limitations

Feature or behavior Description
Kafka topic length limit

It isn't possible to use a Kafka topic with a name longer than 128 characters (not including quotation marks). For more information, see Names and identifiers.

Incremental refreshes and JOINs on a materialized view

The materialized view must be incrementally maintainable. Full recompute is not possible for Kinesis or Amazon MSK because they don't preserve stream or topic history past 24 hours or 7 days, by default. You can set longer data retention periods in Kinesis or Amazon MSK. However, this can result in more maintenance and cost. Additionally, JOINs are not currently supported on materialized views created on a Kinesis stream, or on an Amazon MSK topic. After creating a materialized view on your stream or topic, you can create another materialized view in order to join your streaming materialized view to other materialized views, tables, or views.

For more information, see REFRESH MATERIALIZED VIEW.

Record parsing

Amazon Redshift streaming ingestion doesn't support parsing records that have been aggregated by the Kinesis Producer Library (KPL Key Concepts - Aggregation). The aggregated records are ingested, but are stored as binary protocol buffer data. (See Protocol buffers for more information.) Depending on how you push data to Kinesis, you may need to turn off this feature.

Decompression

VARBYTE does not currently support any decompression methods. Because of this, records containing compressed data can't be queried inside Amazon Redshift. Decompress your data before pushing it into the Kinesis stream or Amazon MSK topic.

Maximum record size

The maximum size of any record field Amazon Redshift can ingest from Kinesis or Amazon MSK is slightly less than 1MB. The following points detail the behavior:

  • Maximum VARBYTE length - The VARBYTE type supports data to a maximum length of 1,024,000 bytes. Because Kinesis limits payloads to 1MB, after Base64 encoding, all Kinesis data can be ingested by Amazon Redshift.

  • Message limits - Default Amazon MSK configuration limits messages to 1MB. Additionally, if a message includes headers, the amount of data is limited to 1,048,470 bytes. With default settings, there are no problems with ingestion. However, you can change the maximum message size for Kafka, and therefore Amazon MSK, to a larger value. In this case, it might be possible for the key/value field of a Kafka record, or the header, to exceed the size limit. These records can cause an error and are not ingested.

Error records

In each case where a record can't be ingested to Amazon Redshift because the size of the data exceeds the maximum size, that record is skipped. Materialized view refresh still succeeds, in this case, and a segment of each error record is written to the SYS_STREAM_SCAN_ERRORS system table. Errors that result from business logic, such as an error in a calculation or an error resulting from a type conversion, are not skipped. Test the logic carefully, before you add logic to your materialized view definition, to avoid these.