Streaming ingestion to a materialized view
This topic describes how to use materialized views for fast access to streaming data.
Streaming ingestion provides
low-latency, high-speed data ingestion from Amazon Kinesis Data Streams
How data flows from a streaming service to Redshift
It helps to understand how streaming ingestion works and the database objects utilized in the process. Data flows directly from a data-stream provider to an Amazon Redshift provisioned cluster or to an Amazon Redshift Serverless workgroup. There isn't a temporary landing area, such as an Amazon S3 bucket. The provisioned cluster or workgroup is the stream consumer. In the Redshift database, the data read from the stream lands in a materialized view. The data is processed as it arrives. For instance, JSON values can be consumed and mapped to a materialized view's data columns, using SQL. When the materialized view is refreshed, Redshift consumes data from allocated Kinesis data shards or Kafka partitions until the view is brought up to date with the stream.
Use cases for Amazon Redshift streaming ingestion involve data that's generated continually and must be processed within a short period, or latency, from its origination. This is commonly called near real-time analytics. Sources can include IT devices, system-telemetry devices, and click-stream data from a busy website or application.
Data parsing best practices for improving performance
When you configure streaming ingestion, there are options in how you can parse the incoming data. Practices can include performing business logic or formatting as data arrives. We recommend the following best practices to avoid errors or data loss. These derived from internal testing and helping customers troublehoot configuration and parsing issues.
Extracting values from streamed data – If you use the JSON_EXTRACT_PATH_TEXT function in your materialized view definition to parse or shred streamed JSON, it can significantly impact performance and latency. To explain, for each column extracted using JSON_EXTRACT_PATH_TEXT, the incoming JSON is re-parsed. After this, data-type conversion, filtering, and business-logic calculations occur. This means, for example, that if you extract 10 columns from JSON data, each JSON record is parsed 10 times, which includes additional logic. This results in higher ingestion latency. An alternative approach we recommend is to use the JSON_PARSE function to convert JSON records to Redshift's SUPER data type. After the streamed data lands in the materialized view, use PartiQL to extract individual strings from the SUPER representation of the JSON data. For more information, see Querying semistructured data.
Additionally, note that JSON_EXTRACT_PATH_TEXT has a 64KB data-size maximum. Thus, if any JSON record is larger than 64KB, processing it with JSON_EXTRACT_PATH_TEXT results in an error.
Mapping an Amazon Kinesis Data Streams stream or Amazon MSK topic to multiple materialized views – We don't recommend creating multiple materialized views to ingest data from a single stream or topic. This is because each materialized view creates a consumer for each shard in the Kinesis Data Streams stream or partition in the Kafka topic. This can result in throttling or exceeding the throughput of the stream or topic. It also can result in higher cost, because you ingest the same data multiple times. When you configure streaming ingestion, we recommend you create one materialized view for each stream or topic.
If your use case requires that you ingest data from one KDS stream or MSK topic into multiple materialized views, consult the AWS Big Data blog
, specifically Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK , before you do so.
Streaming ingestion behavior and data types
The following table describes technical behavior details and size limits for various data types. We recommend being familiar with these prior to configuring a materialized view for streaming ingestion.
Feature or behavior | Description |
---|---|
Kafka topic length limit | It isn't possible to use a Kafka topic with a name longer than 128 characters (not including quotation marks). For more information, see Names and identifiers. |
Incremental refreshes and JOINs on a materialized view | The materialized view must be incrementally maintainable. Full recompute is not possible for Kinesis or Amazon MSK because they don't preserve stream or topic history past 24 hours or 7 days, by default. You can set longer data retention periods in Kinesis or Amazon MSK. However, this can result in more maintenance and cost. Additionally, JOINs are not currently supported on materialized views created on a Kinesis stream, or on an Amazon MSK topic. After creating a materialized view on your stream or topic, you can create another materialized view in order to join your streaming materialized view to other materialized views, tables, or views. For more information, see REFRESH MATERIALIZED VIEW. |
Record parsing | Amazon Redshift streaming ingestion doesn't support parsing records that have been aggregated by the Kinesis
Producer Library (KPL Key Concepts - Aggregation). The aggregated
records are ingested, but are stored as binary protocol buffer
data. (See Protocol buffers |
Decompression |
|
Maximum record size | The maximum size of any record field Amazon Redshift can ingest from Kinesis or Amazon MSK is slightly less than 1MB. The following points detail the behavior:
NoteAmazon Redshift supports a maximum size of 1,024,000 bytes for streaming ingestion from Kinesis or Amazon MSK, even though Amazon Redshift supports a maximum size of 16 MB for the |
Error records | In each case where a record can't be ingested to Redshift because the size of the data exceeds the maximum , that record is skipped. Materialized view refresh still succeeds, in this case, and a segment of each error record is written to the SYS_STREAM_SCAN_ERRORS system table. Errors that result from business logic, such as an error in a calculation or an error resulting from a type conversion, are not skipped. Test the logic carefully before you add it to your materialized view definition. |
Amazon MSK Multi-VPC private connectivity | Amazon MSK multi-VPC private connectivity isn't currently supported for Redshift streaming ingestion. Alternatively, you can use VPC peering to connect VPCs or AWS Transit Gateway to connect VPCs and on-premises networks through a central hub. Either of these can enable Redshift to communicate with an Amazon MSK cluster or with Amazon MSK Serverless in another VPC. |
Auto refresh usage and activation | Auto refresh queries for a materialized view or views are treated as any other user workload. Auto refresh loads data from the stream as it arrives. Auto refresh can be turned on explicitly for a materialized view created for streaming
ingestion. To do this, specify |
Streaming ingestion and Amazon Redshift Serverless | The setup and configuration instructions that apply to Amazon Redshift streaming ingestion on a provisioned cluster also apply to streaming ingestion on Amazon Redshift Serverless. It's important to specify the necessary level of RPUs to support streaming ingestion with auto refresh and other workloads. For more information, see Billing for Amazon Redshift Serverless. |
Amazon Redshift nodes in a different availability zone than the Amazon MSK cluster | When you configure streaming ingestion, Amazon Redshift attempts to connect to an Amazon MSK cluster in the same availability one, if rack awareness is enabled for Amazon MSK. If all of your nodes are in different availability zones than your Amazon Redshift cluster, you can incur cross availability zone data-transfer cost. To avoid this, keep at least one Amazon MSK broker cluster node in the same AZ as your Redshift provisioned cluster or workgroup. |
Refresh start location | After creating a materialized view, its initial refresh starts from
the |
Data formats |
Supported data formats are limited to those that can be converted from |
Appending records to a table | You can run |
Running TRUNCATE or DELETE | You can remove records from a materialized view that's used for streaming ingestion, using the following:
|