Streaming ingestion to a materialized view - Amazon Redshift

Streaming ingestion to a materialized view

This topic describes how to use materialized views for fast access to streaming data.

Streaming ingestion provides low-latency, high-speed data ingestion from Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka to an Amazon Redshift provisioned or Amazon Redshift Serverless database. The data lands in a Redshift materialized view that's configured for the purpose. This results in fast access to external data. Streaming ingestion lowers data-access time and reduces storage cost. You can configure it for your Amazon Redshift cluster or for your Amazon Redshift Serverless workgroup, using a small collection of SQL commands. After it's set up, each materialized-view refresh can ingest hundreds of megabytes of data per second.

How data flows from a streaming service to Redshift

It helps to understand how streaming ingestion works and the database objects utilized in the process. Data flows directly from a data-stream provider to an Amazon Redshift provisioned cluster or to an Amazon Redshift Serverless workgroup. There isn't a temporary landing area, such as an Amazon S3 bucket. The provisioned cluster or workgroup is the stream consumer. In the Redshift database, the data read from the stream lands in a materialized view. The data is processed as it arrives. For instance, JSON values can be consumed and mapped to a materialized view's data columns, using SQL. When the materialized view is refreshed, Redshift consumes data from allocated Kinesis data shards or Kafka partitions until the view is brought up to date with the stream.

Use cases for Amazon Redshift streaming ingestion involve data that's generated continually and must be processed within a short period, or latency, from its origination. This is commonly called near real-time analytics. Sources can include IT devices, system-telemetry devices, and click-stream data from a busy website or application.

Data parsing best practices for improving performance

When you configure streaming ingestion, there are options in how you can parse the incoming data. Practices can include performing business logic or formatting as data arrives. We recommend the following best practices to avoid errors or data loss. These derived from internal testing and helping customers troublehoot configuration and parsing issues.

  • Extracting values from streamed data – If you use the JSON_EXTRACT_PATH_TEXT function in your materialized view definition to parse or shred streamed JSON, it can significantly impact performance and latency. To explain, for each column extracted using JSON_EXTRACT_PATH_TEXT, the incoming JSON is re-parsed. After this, data-type conversion, filtering, and business-logic calculations occur. This means, for example, that if you extract 10 columns from JSON data, each JSON record is parsed 10 times, which includes additional logic. This results in higher ingestion latency. An alternative approach we recommend is to use the JSON_PARSE function to convert JSON records to Redshift's SUPER data type. After the streamed data lands in the materialized view, use PartiQL to extract individual strings from the SUPER representation of the JSON data. For more information, see Querying semistructured data.

    Additionally, note that JSON_EXTRACT_PATH_TEXT has a 64KB data-size maximum. Thus, if any JSON record is larger than 64KB, processing it with JSON_EXTRACT_PATH_TEXT results in an error.

  • Mapping an Amazon Kinesis Data Streams stream or Amazon MSK topic to multiple materialized views – We don't recommend creating multiple materialized views to ingest data from a single stream or topic. This is because each materialized view creates a consumer for each shard in the Kinesis Data Streams stream or partition in the Kafka topic. This can result in throttling or exceeding the throughput of the stream or topic. It also can result in higher cost, because you ingest the same data multiple times. When you configure streaming ingestion, we recommend you create one materialized view for each stream or topic.

    If your use case requires that you ingest data from one KDS stream or MSK topic into multiple materialized views, consult the AWS Big Data blog, specifically Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK, before you do so.

Streaming ingestion behavior and data types

The following table describes technical behavior details and size limits for various data types. We recommend being familiar with these prior to configuring a materialized view for streaming ingestion.

Feature or behavior Description
Kafka topic length limit

It isn't possible to use a Kafka topic with a name longer than 128 characters (not including quotation marks). For more information, see Names and identifiers.

Incremental refreshes and JOINs on a materialized view

The materialized view must be incrementally maintainable. Full recompute is not possible for Kinesis or Amazon MSK because they don't preserve stream or topic history past 24 hours or 7 days, by default. You can set longer data retention periods in Kinesis or Amazon MSK. However, this can result in more maintenance and cost. Additionally, JOINs are not currently supported on materialized views created on a Kinesis stream, or on an Amazon MSK topic. After creating a materialized view on your stream or topic, you can create another materialized view in order to join your streaming materialized view to other materialized views, tables, or views.

For more information, see REFRESH MATERIALIZED VIEW.

Record parsing

Amazon Redshift streaming ingestion doesn't support parsing records that have been aggregated by the Kinesis Producer Library (KPL Key Concepts - Aggregation). The aggregated records are ingested, but are stored as binary protocol buffer data. (See Protocol buffers for more information.) Depending on how you push data to Kinesis, you may need to turn off this feature.

Decompression

VARBYTE doesn't support decompression. Because of this, records containing compressed data can't be queried in Redshift. Decompress your data before adding it to the Kinesis stream or Amazon MSK topic.

Maximum record size

The maximum size of any record Amazon Redshift can ingest from Kinesis or Amazon MSK is 16,777,216 bytes (16 MiB), the maximum size supported by the VARBYTE datatype in Amazon Redshift. By default, Amazon Redshift streaming materialized views created on a Kinesis data stream or Amazon MSK topic will set the size of the data column to 1,048,576 bytes (1 MiB) and 16,777,216 bytes (16 MiB) respectively.

Note

1MiB is the current maximum size of any record that can be put in a Kinesis data stream. For more information on Kinesis size limits, go to Quotas and limits in the Amazon Kinesis Data Streams Developer Guide.

Error records

In each case where a record can't be ingested to Redshift because the size of the data exceeds the maximum , that record is skipped. Materialized view refresh still succeeds, in this case, and a segment of each error record is written to the SYS_STREAM_SCAN_ERRORS system table. Errors that result from business logic, such as an error in a calculation or an error resulting from a type conversion, are not skipped. Test the logic carefully before you add it to your materialized view definition.

Amazon MSK Multi-VPC private connectivity

Amazon MSK multi-VPC private connectivity isn't currently supported for Redshift streaming ingestion. Alternatively, you can use VPC peering to connect VPCs or AWS Transit Gateway to connect VPCs and on-premises networks through a central hub. Either of these can enable Redshift to communicate with an Amazon MSK cluster or with Amazon MSK Serverless in another VPC.

Auto refresh usage and activation

Auto refresh queries for a materialized view or views are treated as any other user workload. Auto refresh loads data from the stream as it arrives.

Auto refresh can be turned on explicitly for a materialized view created for streaming ingestion. To do this, specify AUTO REFRESH in the materialized view definition. Manual refresh is the default. To specify auto refresh for an existing materialized view for streaming ingestion, you can run ALTER MATERIALIZED VIEW to turn it on. For more information, see CREATE MATERIALIZED VIEW or ALTER MATERIALIZED VIEW.

Streaming ingestion and Amazon Redshift Serverless

The setup and configuration instructions that apply to Amazon Redshift streaming ingestion on a provisioned cluster also apply to streaming ingestion on Amazon Redshift Serverless. It's important to specify the necessary level of RPUs to support streaming ingestion with auto refresh and other workloads. For more information, see Billing for Amazon Redshift Serverless.

Amazon Redshift nodes in a different availability zone than the Amazon MSK cluster

When you configure streaming ingestion, Amazon Redshift attempts to connect to an Amazon MSK cluster in the same availability one, if rack awareness is enabled for Amazon MSK. If all of your nodes are in different availability zones than your Amazon Redshift cluster, you can incur cross availability zone data-transfer cost. To avoid this, keep at least one Amazon MSK broker cluster node in the same AZ as your Redshift provisioned cluster or workgroup.

Refresh start location

After creating a materialized view, its initial refresh starts from the TRIM_HORIZON of a Kinesis stream, or from offset 0 of an Amazon MSK topic.

Data formats

Supported data formats are limited to those that can be converted from VARBYTE. For more information, see VARBYTE type and VARBYTE operators.

Appending records to a table

You can run ALTER TABLE APPEND to append rows to a target table from an existing source materialized view. This works only if the materialized view is configured for streaming ingestion. For more information, see ALTER TABLE APPEND.

Running TRUNCATE or DELETE

You can remove records from a materialized view that's used for streaming ingestion, using the following:

  • TRUNCATE – This deletes all of the rows from a materialized view that's configured for streaming ingestion. It doesn't do a table scan. For more information, see TRUNCATE.

  • DELETE – This deletes all of the rows from a materialized view that's configured for streaming ingestion. For more information, see DELETE.