Streaming ingestion
Streaming ingestion provides
low-latency, high-speed ingestion of stream data from Amazon Kinesis Data Streams
Data flow
An Amazon Redshift provisioned cluster or an Amazon Redshift Serverless workgroup is the stream consumer. A materialized view is the landing
area for data read from the stream, which is processed as it arrives. For instance, JSON values can
be consumed and mapped to the materialized view's data columns, using familiar SQL. When the materialized view is
refreshed, Amazon Redshift compute nodes allocate each Kinesis data shard or Kafka partition to a compute
slice. Each slice consumes data from the allocated shards until the view reaches parity with the SEQUENCE_NUMBER
for the Kinesis stream
or last Offset
for the Kafka topic. Subsequent materialized
view refreshes read data from the last SEQUENCE_NUMBER
of the
previous refresh until it reaches parity with the stream or topic data.
Streaming ingestion use cases
Use cases for Amazon Redshift streaming ingestion involve working with data that's generated continually (streamed) and must be processed within a short period (latency) of its generation. This is called near real-time analytics. Sources of data can vary, and include IoT devices, system telemetry data, or clickstream data from a busy website or application.
Streaming ingestion considerations
The following are important considerations and best practices for performance and billing as you set up your streaming ingestion environment.
Auto refresh usage and activation - Auto refresh queries for a materialized view or views are treated as any other user workload. Auto refresh loads data from the stream as it arrives.
Auto refresh can be turned on explicitly for a materialized view created for streaming ingestion. To do this, specify
AUTO REFRESH
in the materialized view definition. Manual refresh is the default. To specify auto refresh for an existing materialized view for streaming ingestion, you can runALTER MATERIALIZED VIEW
to turn it on. For more information, see CREATE MATERIALIZED VIEW or ALTER MATERIALIZED VIEW.-
Streaming ingestion and Amazon Redshift Serverless - The same setup and configuration instructions that apply to Amazon Redshift streaming ingestion on a provisioned cluster also apply to streaming ingestion on Amazon Redshift Serverless. It's important to size Amazon Redshift Serverless with the necessary level of RPUs to support streaming ingestion with auto refresh and other workloads. For more information, see Billing for Amazon Redshift Serverless.
-
Amazon Redshift nodes in a different availability zone than the Amazon MSK cluster - When you configure streaming ingestion, Amazon Redshift attempts to connect to an Amazon MSK cluster in the same Availability Zone, if rack awareness is enabled for Amazon MSK. If all of your nodes are in different Availability Zones than your Amazon Redshift cluster, you can incur cross Availability Zone data-transfer cost. To avoid this, keep at least one Amazon MSK broker cluster node in the same AZ as your Amazon Redshift cluster.
-
Refresh start location - After creating a materialized view, its initial refresh starts from the
TRIM_HORIZON
of a Kinesis stream, or from offset 0 of an Amazon MSK topic. -
Data formats - Supported data formats are limited to those that can be converted from
VARBYTE
. For more information, see VARBYTE type and VARBYTE operators. Streaming to multiple materialized views - In Amazon Redshift, we recommend in most cases that you land the data for each stream in a single materialized view. However, it is possible to ingest a stream and land the data in multiple materialized views. For instance, a use case where you ingest a stream containing sports data, but you organize data for each sport into a separate materialized views.
Note that when you ingest data into and refresh multiple materialized views, there can be higher egress costs, specifically for reading data from the streaming provider. Additionally, higher resource use for reading into more than one materialized view can impact other workloads. Also note bandwidth, throughput, and performance limitations for your streaming provider. For more information about pricing for data streams, see Kinesis Data Streams pricing
and Amazon Managed Streaming for Apache Kafka pricing .
Using streaming ingestion compared with staging data in Amazon S3
There are several options for streaming data to Amazon Redshift or to Amazon Redshift Serverless. Two well-known options are streaming ingestion, as described in this topic, or setting up a delivery stream to Amazon S3 with Kinesis Data Firehose. The following list describes each method:
Streaming ingestion from Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka to Amazon Redshift or Amazon Redshift Serverless involves configuring a materialized view to receive the data.
Delivering data into Amazon Redshift using Kinesis Data Streams and streaming through Kinesis Data Firehose involves connecting the source stream to Amazon Kinesis Data Firehose and waiting for Kinesis Data Firehose to stage the data in Amazon S3. This process makes use of various-sized batches at varying-length buffer intervals. After streaming to Amazon S3, Kinesis Data Firehose initiates a COPY command to load the data.
With streaming ingestion, you bypass several steps that are required for the second process:
You don't have to send data to an Amazon Kinesis Data Firehose delivery stream, because with streaming ingestion, data can be sent directly from Kinesis Data Streams to a materialized view in a Redshift database.
You don't have to land streamed data in Amazon S3, because streaming ingestion data goes directly to the Redshift materialized view.
You don't have to write and run COPY commands because the data in the materialized view is refreshed directly from the stream. Loading data from Amazon S3 to Redshift isn't part of the process.
Note that streaming ingestion is limited to streams from Amazon Kinesis Data Streams and topics from Amazon MSK. For streaming from Kinesis Data Streams to sources other than Amazon Redshift, it's likely that you need a Kinesis Data Firehose delivery stream. For more information, see Sending Data to an Amazon Kinesis Data Firehose Delivery Stream.
Limitations
Feature or behavior | Description |
---|---|
Kafka topic length limit | It isn't possible to use a Kafka topic with a name longer than 128 characters (not including quotation marks). For more information, see Names and identifiers. |
Incremental refreshes and JOINs on a materialized view | The materialized view must be incrementally maintainable. Full recompute is not possible for Kinesis or Amazon MSK because they don't preserve stream or topic history past 24 hours or 7 days, by default. You can set longer data retention periods in Kinesis or Amazon MSK. However, this can result in more maintenance and cost. Additionally, JOINs are not currently supported on materialized views created on a Kinesis stream, or on an Amazon MSK topic. After creating a materialized view on your stream or topic, you can create another materialized view in order to join your streaming materialized view to other materialized views, tables, or views. For more information, see REFRESH MATERIALIZED VIEW. |
Record parsing | Amazon Redshift streaming ingestion doesn't support parsing records that have been aggregated by the Kinesis
Producer Library (KPL Key Concepts - Aggregation). The aggregated
records are ingested, but are stored as binary protocol buffer
data. (See Protocol buffers |
Decompression |
|
Maximum record size | The maximum size of any record field Amazon Redshift can ingest from Kinesis or Amazon MSK is slightly less than 1MB. The following points detail the behavior:
|
Error records | In each case where a record can't be ingested to Redshift because the size of the data exceeds the maximum size, that record is skipped. Materialized view refresh still succeeds, in this case, and a segment of each error record is written to the SYS_STREAM_SCAN_ERRORS system table. Errors that result from business logic, such as an error in a calculation or an error resulting from a type conversion, are not skipped. Test the logic carefully, before you add logic to your materialized view definition, to avoid these. |