Tiered storage - Amazon Managed Streaming for Apache Kafka

Tiered storage

Tiered storage is a low-cost storage tier for Amazon MSK that scales to virtually unlimited storage, making it cost-effective to build streaming data applications.

You can create an Amazon MSK cluster configured with tiered storage that balances performance and cost. Amazon MSK stores streaming data in a performance-optimized primary storage tier until it reaches the Apache Kafka topic retention limits. Then, Amazon MSK automatically moves data into the new low-cost storage tier.

When your application starts reading data from the tiered storage, you can expect an increase in read latency for the first few bytes. As you start reading the remaining data sequentially from the low-cost tier, you can expect latencies that are similar to the primary storage tier. You don't need to provision any storage for the low-cost tiered storage or manage the infrastructure. You can store any amount of data and pay only for what you use. This feature is compatible with the APIs introduced in KIP-405: Kafka Tiered Storage.

Here are some of the features of tiered storage:

  • You can scale to virtually unlimited storage. You don't have to guess how to scale your Apache Kafka infrastructure.

  • You can retain data longer in your Apache Kafka topics, or increase your topic storage, without the need to increase the number of brokers.

  • It provides a longer duration safety buffer to handle unexpected delays in processing.

  • You can reprocess old data in its exact production order with your existing stream processing code and Kafka APIs.

  • Partitions rebalance faster because data on secondary storage doesn't require replication across broker disks.

  • It best serves applications that must retain data for longer than a day, or more than 1-2 TB per broker.

  • Data between brokers and the tiered storage moves within the VPC and doesn't travel through the internet.

  • You can use tiered storage at a cluster level, but this doesn't automatically enable tiered storage for all topics by default.

  • A client machine can use the same process to connect to new clusters with tiered storage enabled as it does to connect to a cluster without tiered storage enabled. See Create a client machine.

Tiered storage requirements

  • You must use Apache Kafka client version 3.0.0 or higher to create a new topic with tiered storage enabled. To transition an existing topic to tiered storage, you can reconfigure a client machine that uses a Kafka client version lower than 3.0.0 (minimum supported Apache Kafka version is 2.8.2.tiered) to enable tiered storage. See Step 3: Create a topic.

  • The Amazon MSK cluster with tiered storage enabled must use version 2.8.2.tiered.

Tiered storage constraints and limitations

Tiered storage has the following constraints and limitations:

  • Tiered storage applies only to provisioned mode clusters.

  • Tiered storage doesn’t support broker type t3.small.

  • The minimum retention period in low-cost storage is 3 days. There is no minimum retention period for primary storage.

  • Tiered storage doesn’t support Multiple Log directories on a broker (JBOD related features).

  • Compacted topics can’t use the tiered storage feature.

  • You can’t re-enable tiered storage at a topic level after you have disabled it. You can't edit the storage mode for a cluster that uses tiered storage. Amazon MSK only supports editing the cluster storage mode when a cluster uses EBS storage.

  • The kafka-log-dirs tool can't report tiered storage data size. The tool only reports the size of the log segments in primary storage.