Best Practices - Amazon Managed Streaming for Apache Kafka

Best Practices

This topic outlines some best practices to follow when using Amazon MSK.

Right-size your cluster

When you create an MSK cluster, you specify the type and number of brokers.

Number of partitions per broker

The following table shows the recommended maximum number of partitions (including leader and follower replicas) per broker. However, the number of partitions per broker is affected by use case and configuration. We also recommend that you perform your own testing to determine the right type for your brokers. For more information about the different broker types, see Broker types.

Broker type Maximum number of partitions (including leader and follower replicas) per broker
kafka.t3.small 300
kafka.m5.large or kafka.m5.xlarge 1000
kafka.m5.2xlarge 2000
kafka.m5.4xlarge, kafka.m5.8xlarge, kafka.m5.12xlarge, kafka.m5.16xlarge, or kafka.m5.24xlarge 4000

For guidance on choosing the number of partitions, see Apache Kafka Supports 200K Partitions Per Cluster.

Number of brokers per cluster

To determine the right number of brokers for your MSK cluster and understand costs, see the MSK Sizing and Pricing spreadsheet. This spreadsheet provides an estimate for sizing an MSK cluster and the associated costs of Amazon MSK compared to a similar, self-managed, EC2-based Apache Kafka cluster. For more information about the input parameters in the spreadsheet, hover over the parameter descriptions. This spreadsheet was the result of running a test workload with three producers and three consumers, and ensuring that P99 write latencies were below 100 ms. This might not reflect your workload or performance expectations. Therefore, we recommend that you test your workloads after provisioning the cluster.

Build highly available clusters

Use the following recommendations so that your MSK cluster can be highly available during an update (such as when you're updating the broker type or Apache Kafka version, for example) or when Amazon MSK is replacing a broker.

  • Ensure that the replication factor (RF) is at least 2 for two-AZ clusters and at least 3 for three-AZ clusters. An RF of 1 can lead to offline partitions during a rolling update.

  • Set minimum in-sync replicas (minISR) to at most RF - 1. A minISR that is equal to the RF can prevent producing to the cluster during a rolling update. A minISR of 2 allows three-way replicated topics to be available when one replica is offline.

  • Ensure client connection strings include multiple brokers. Having multiple brokers in a client’s connection string allows for failover when a specific broker is offline for an update. For information about how to get a connection string with multiple brokers, see Getting the bootstrap brokers for an Amazon MSK Cluster.

Monitor CPU usage

Amazon MSK strongly recommends that you maintain the total CPU utilization for your brokers under 60%. Total CPU utilization is the sum of the CpuUser and CpuSystem metrics. When you have at least 40% of your cluster’s total CPU available, Apache Kafka can redistribute CPU load across brokers in the cluster when necessary. One example of when this is necessary is when Amazon MSK detects and recovers from a broker fault; in this case, Amazon MSK performs automatic maintenance, like patching. Another example is when a user requests a broker-type change or version upgrade; in these two cases, Amazon MSK deploys rolling workflows that take one broker offline at a time. When brokers with lead partitions go offline, Apache Kafka reassigns partition leadership to redistribute work to other brokers in the cluster. By following this best practice you can ensure you have enough CPU headroom in your cluster to tolerate operational events like these.

You can use Amazon CloudWatch metric math to create a composite CPU metric that is the sum of CpuUser and CpuSystem. Set an alarm that gets triggered when the composite CPU metric reaches a P95 of 60%. When the alarm is triggered, scale the cluster using one of the following options:

  • Option 1 (recommended): Update your broker type to the next larger type. For example, if the current type is kafka.m5.large, update the cluster to use kafka.m5.xlarge. Keep in mind that when you update the broker type in the cluster, Amazon MSK takes brokers offline in a rolling fashion and temporarily reassigns partition leadership to other brokers. A size update typically takes 10-15 minutes per broker.

  • Option 2: If there are topics with all messages ingested from producers that use round-robin writes (in other words, messages aren't keyed and ordering isn’t important to consumers), expand your cluster by adding brokers. Also add partitions to existing topics with the highest throughput. Next, use kafka-topics.sh --describe to ensure that newly added partitions are assigned to the new brokers. The main benefit of this option compared to the previous one is that you can manage resources and costs more granularly. Additionally, you can use this option if CPU load significantly exceeds 60% because this form of scaling doesn't typically result in increased load on existing brokers.

  • Option 3: Expand your cluster by adding brokers, then reassign existing partitions by using the partition reassignment tool named kafka-reassign-partitions.sh. However, if you use this option, the cluster will need to spend resources to replicate data from broker to broker after partitions are reassigned. Compared to the two previous options, this can significantly increase the load on the cluster at first. As a result, Amazon MSK doesn't recommend using this option when CPU utilization is above 70% because replication causes additional CPU load and network traffic. Amazon MSK only recommends using this option if the two previous options aren't feasible.

Other recommendations:

  • Monitor total CPU utilization per broker as a proxy for load distribution. If brokers have consistently uneven CPU utilization it might be a sign that load isn't evenly distributed within the cluster. Amazon MSK recommends using Cruise Control to continuously manage load distribution via partition assignment.

  • Monitor produce and consume latency. Produce and consume latency can increase linearly with CPU utilization.

Monitor disk space

To avoid running out of disk space for messages, create a CloudWatch alarm that watches the KafkaDataLogsDiskUsed metric. When the value of this metric reaches or exceeds 85%, perform one or more of the following actions:

For information on how to set up and use alarms, see Using Amazon CloudWatch Alarms. For a full list of Amazon MSK metrics, see Monitoring an Amazon MSK Cluster.

Adjust data retention parameters

Consuming messages doesn't remove them from the log. To free up disk space regularly, you can explicitly specify a retention time period, which is how long messages stay in the log. You can also specify a retention log size. When either the retention time period or the retention log size are reached, Apache Kafka starts removing inactive segments from the log.

To specify a retention policy at the cluster level, set one or more of the following parameters: log.retention.hours, log.retention.minutes, log.retention.ms, or log.retention.bytes. For more information, see Custom MSK Configurations.

You can also specify retention parameters at the topic level:

  • To specify a retention time period per topic, use the following command.

    kafka-configs.sh --zookeeper ZooKeeperConnectionString --alter --entity-type topics --entity-name TopicName --add-config retention.ms=DesiredRetentionTimePeriod
  • To specify a retention log size per topic, use the following command.

    kafka-configs.sh --zookeeper ZooKeeperConnectionString --alter --entity-type topics --entity-name TopicName --add-config retention.bytes=DesiredRetentionLogSize

The retention parameters that you specify at the topic level take precedence over cluster-level parameters.

Don't add non-MSK brokers

If you use Apache ZooKeeper commands to add brokers, these brokers don't get added to your MSK cluster, and your Apache ZooKeeper will contain incorrect information about the cluster. This might result in data loss. For supported cluster operations, see Amazon MSK: How It Works.

Enable in-transit encryption

For information about encryption in transit and how to enable it, see Encryption in Transit.

Reassign partitions

To move partitions to different brokers on the same cluster, you can use the partition reassignment tool named kafka-reassign-partitions.sh. For example, after you add new brokers to expand a cluster, you can rebalance that cluster by reassigning partitions to the new brokers. For information about how to add brokers to a cluster, see Expanding an Amazon MSK Cluster. For information about the partition reassignment tool, see Expanding your cluster in the Apache Kafka documentation.