Infrastructure Reliability Security Cost Performance Capacity Monitoring Testing

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Infrastructure

Before you can create an Amazon MSK cluster, you need to have an Amazon Virtual Private Cloud (Amazon VPC) and subnets set up within that VPC. In most AWS Regions where Amazon MSK is available, you can specify either two or three subnets. The subnets must all be located in different Availability Zones. When you create a cluster, Amazon MSK distributes the broker nodes evenly across the subnets you specify. The following reference architecture shows an example of an Amazon MSK cluster infrastructure.

Amazon MSK cluster infrastructure

In this reference architecture, the brokers and Apache ZooKeeper nodes are deployed into an AWS service-managed VPC, dedicated to each cluster. You can limit access to the Apache ZooKeeper nodes by assigning them a separate security group. The brokers in the cluster are made accessible to clients in the VPC through elastic network interfaces that appear in the customer account. Traffic between clients and brokers is private by default, meaning it doesn't transfer over the public internet. The Security Groups on the network interfaces activate protocol and port-based traffic filtering on the brokers.

The IP addresses from the customer VPC are attached to the network interfaces, and by default all network traffic stays within the AWS network and is not accessible to the internet. Connections between clients and an Amazon MSK cluster stay private.

Amazon MSK provides control-plane operations for creating, updating, and deleting clusters. Customers use Apache Kafka data-plane operations for producing and consuming data. For a complete list of operations, see the Amazon MSK API Reference Guide. For more information on cluster provisioning, see Getting Started Using Amazon MSK.

At the time of this publication, Amazon MSK supports instance types from the M and T instance families. For the most up-to-date list of supported instance types, see Broker types.

Reliability

The following reliability best practices are recommended when running Amazon MSK:

Deploy Amazon MSK clusters across three Availability Zones to ensure high availability. This is the default cluster provisioning configuration.
Ensure a replication factor (RF) of 3 or at least 2 for each topic. A RF of 1 can lead to offline partitions during a rolling update or broker failure scenarios.
Set minimum in-sync replicas (minISR) to RF -1. A minISR that is equal to the RF can prevent producing messages to the cluster during a rolling update. With a minISR of 2, three-way replicated topics are available when one replica is offline.
Upgrade to the recommended version of Apache Kafka on Amazon MSK. Upgrading your cluster software ensures that you are using the latest stable features from open source applications, including:
- Performance enhancements that help applications run faster.
- Apache Kafka bug, and security fixes, and feature enhancements.
Depending on the scale of your workload, evaluate the use of multiple clusters. Amazon MSK uses a rolling broker update process for many of its operations, including version updates and updating broker types. Operations are carried out one broker at a time and can vary in cycle time. By using multiple clusters, you can parallelize these operations, reducing overall time to completion per cluster. In addition, you can reduce the impact of a cluster-level issue such as an improper configuration by segmenting the workload into multiple clusters.

Security

Amazon MSK has a range of tools and methods to help secure your data processing in the AWS Cloud.

Network Security

By default, clients can privately connect to an Amazon MSK cluster within their Amazon VPC, with traffic traversing only within the AWS network.

Enable Transport Layer Security (TLS) encryption between brokers and clients.
Use Security Group rules to control the protocols and port numbers allowed on your brokers.
Use Apache Kafka APIs to configure your cluster properties and restrict Apache ZooKeeper access with dedicated security groups.

Authentication and Authorization

Amazon MSK provides various methods for access management. AWS recommends using AWS Identity and Access Management (IAM) for managing client authentication and authorization, based on security, ease of use, and low cost. For more information see, IAM Access Control in the Amazon MSK Developer Guide.

Alternatively, you can use mutual Transport Layer Security (mTLS) with ACM Private CA or SASL/SCRAM authentication with Apache Kafka Access Control List (ACL) authorization.

Logging and Auditing

Amazon MSK makes it easy for you to troubleshoot clients and analyze communication by streaming Apache Kafka broker logs to Amazon CloudWatch, Amazon Simple Storage Service (Amazon S3), or Amazon Data Firehose. AWS recommends enabling logging. Amazon MSK service API actions are logged in AWS CloudTrail where you can view and search for recent events.

Encryption

Encrypt data-at-rest using Customer-managed CMK.

Amazon MSK recommends enabling encryption-in-transit.

Cost

With Amazon MSK, there are no minimum fees or upfront commitments. You do not pay for Apache ZooKeeper nodes that Amazon MSK provisions for you, or for data transfer that occurs between brokers and nodes within clusters.

You pay for the time your broker instances run and the storage you use monthly. Additionally, outside of Amazon MSK costs, there could be data transfer charges applied.

Note

Amazon MSK now supports Apache Kafka version 2.4.1 for new clusters so that consumers can fetch from the closest replica, which reduces latency and potential cross-AZ data transfer.

Use the Amazon MSK sizing and pricing spreadsheet to correctly size the cluster.
Configure the message retention period or log size to minimize storage costs.
Track Amazon MSK cluster costs with custom tagging.
Use T-family broker instance types for low-cost development, for testing small-to-medium streaming workloads, or low-throughput streaming workloads that experience temporary spikes in throughput. M5 brokers have higher baseline throughput performance than T3 brokers and are recommended for production workloads.

Performance

Amazon MSK is designed to provide performance that is identical to an Apache Kafka cluster running on Amazon Elastic Compute Cloud (Amazon EC2).

Use Amazon MSK sizing and pricing spreadsheet to right-size your cluster for performance.
Test your workloads after provisioning the cluster.
Use M5 broker instance types for production workloads.
Follow the Number of partitions per broker guide to get the expected performance. For guidance on partition counts, see the Kafka documentation. A high number of partitions will lead to higher CPU usage and can lead to a delay in metrics availability or cluster downtime. Decrease the number of partitions by deleting unused topics or migrating to a larger instance type for your brokers.
Set minimum in-sync replicas (minISR) to at most RF - 1. A minISR that is equal to the RF can prevent producing to the cluster during a rolling update. With a minISR of 2, three-way replicated topics are available when one replica is offline.

Capacity

When building out capacity for an Amazon MSK cluster, consider the following two key factors.

Instance type for the broker

When creating an Amazon MSK cluster, you can specify the instance type of the brokers at cluster launch. You can start with a few brokers within an Amazon MSK cluster. Then, using the AWS Management Console or the AWS Command Line Interface (AWS CLI), you can scale up to 90 brokers per account and 30 brokers per cluster. These are soft limits, and can be adjusted by requesting a quota increase. Alternatively, you can scale your clusters by changing the size or family of your Apache Kafka brokers, which is a best practice because you don't need to reassign Apache Kafka partitions in order to scale up or down.

The M5 brokers have higher baseline throughput performance and support more partitions per broker than the T3 brokers, and are therefore recommended for production workloads. For more information about M5 instance types, see Amazon EC2 M5 Instances.

T3 brokers have the ability to use CPU credits to temporarily burst performance. Use T3 brokers for low-cost development if you are testing small to medium streaming workloads, or if you have low-throughput streaming workloads that experience temporary spikes in throughput. For more information about T3 instance types, see Amazon EC2 T3 Instances.

The broker instance type will affect the maximum number of partitions that each broker can support. This recommendation provides the recommended maximum number of partitions (including replica partitions) per broker. Perform testing to determine the right instance type for your brokers.

Number of brokers per cluster

Use the Amazon MSK Sizing and Pricing spreadsheet to determine the correct number of brokers for your Amazon MSK cluster. This spreadsheet provides an estimate for sizing an Amazon MSK cluster and the associated costs of Amazon MSK compared to a similar, self-managed, EC2-based Apache Kafka cluster.

Input parameters in the spreadsheet

Basic parameters:

Ingestion rate (MBs): Number of messages/second divided by average message size going into the cluster.
Replication factor: replication factor setting for Apache Kafka.
Total data out (MBs): similar to ingestion rate, but for messages leaving the cluster.
Retention: how long is data retained in cluster in hours. It is recommended to choose an average retention period across topics for the sizing exercise.
Encryption in-transit: TLS encryption in-transit activated. TLS encryption consumes additional CPU.
EBS disk utilization: Desired percentage usage of EBS volumes. It is good to have some headroom to scale. You can also leverage the auto expansion feature to set target disk utilization and maximum scaling capacity.

Advanced parameters:

Number of AZs: AWS recommends three AZs.
Nearest replica fetching (Apache Kafka 2.4.x): reduce latency and cross-AZ data transfer.
Lagging consumer percentage.
Broker Safety Factor: Number of brokers that can go offline without affecting performance.
RAM data cached, in seconds.

Based on these inputs, the following outputs are returned:

Storage load (required, provisioned storage and Total IOPS)
Networking (Read/Write throughput, MB/s)
Data Transfer (Cross-AZ and inter-broker replication traffic)
Memory (RAM, GB)

Based on these outputs and the instance type selection in the summary section of the Sizing and Pricing spreadsheet, depending on production/test instances and number of partitions, the spreadsheet will provide the recommended number of brokers. The Sizing and Pricing spreadsheet is the result of running a test workload with three consumers, and ensuring that P99 write latencies are below 100 ms. This may not necessarily reflect your workload or performance expectations. Use this spreadsheet as a starting point and run performance tests for your workloads after provisioning the cluster.

It is possible to change the number of brokers in the Amazon MSK cluster after creation. After resizing, the cluster will need to be rebalanced by reassigning partitions to the new brokers.

Monitoring

There are two monitoring options: Amazon CloudWatch and Open Monitoring for Prometheus. If you already ingest metrics from Apache Kafka into a metrics solution such as DataDog or New Relic, it is recommended to use Open Monitoring with Prometheus to maintain compatibility throughout your migration, and supplementing these metrics by those offered in Amazon CloudWatch. The Open Monitoring with Prometheus option also provides an integration with Cruise Control to manage Apache Kafka partitions.

With Amazon CloudWatch, you can set the monitoring level for an Amazon MSK cluster to one of the following:

DEFAULT
PER_BROKER
PER_TOPIC_PER_BROKER
PER_TOPIC_PER_PARTITION

Note

The DEFAULT level is included with the Amazon MSK cluster whereas there is a cost associated with the other monitoring levels. The PER_TOPIC_PER_PARTITION level is only relevant for consumer lag.

Consumer-lag metrics quantify the difference between the latest data written to your topics and the data read by your applications. Amazon MSK provides the following consumer-lag metrics, which you can get through Amazon CloudWatch or through Open Monitoring with Prometheus:

OffsetLag: Partition-level consumer lag in number of offsets.
MaxOffsetLag: The maximum offset lag across all partitions in a topic.
EstimatedMaxTimeLag: Time estimate (in seconds) to drain MaxOffsetLag.
EstimatedTimeLag: Time estimate (in seconds) to drain the partition offset lag.
SumOffsetLag: The aggregated offset lag for all the partitions in a topic.

Recommended metrics to monitor with Amazon CloudWatch

KafkaDataLogsDiskUsed: Cluster should not reach or exceed 85% of disk space usage. Higher disk space usage can impact receiving and transmitting. Configure an Amazon CloudWatch alarm that monitors KafkaDataLogsDiskUsed metrics and alerts you if disk space used is too high.

To mitigate storage issues, you can do one of the following:
- Seamlessly scale up the amount of storage provisioned per broker
- Create an auto scaling policy to automatically expand your storage
- Change the retention period of messages and logs.
- Delete unused topics.
UnderMinIsrPartitionCount: The In-Sync Replication (ISR) count indicates the set of replicas up-to-date with the leader. The expected value for UnderMinIsrPartitionCount is zero.
ActiveControllerCount: Monitor ActiveControllerCount and set an alarm to alert you in the event ActiveControllerCount (Maximum) statistics is greater than 1.
OfflinePartitionsCount: This metric reports the number of partitions that don’t have an active leader and are hence not writable or readable. You should configure alert on a value of <0.
UnderReplicatedPartitions: The number of in-sync replicas (ISRs) should be equal to the total number of replicas. As a general best practice, you should set an alert using a value greater than 0.
UnderMinIsrPartitionCount: This metric should not be greater than zero and you should set an alert on a non-zero value.
CPU: Amazon MSK strongly recommends that you maintain the total CPU utilization for your brokers under 60%.
RequestTime: The average time in milliseconds spent in broker network and I/O threads to process requests. Indicates the percentage of time spent in broker network and I/O threads to process requests from client group.
BytesInPerSec/BytesOutPerSec: The number of bytes per second received from clients and sent to clients.

For more information, see Amazon MSK Metrics for Monitoring with CloudWatch.

Testing

To minimize risks when migrating from a self-managed Apache Kafka cluster to Amazon MSK, AWS recommends migrating between identical Apache Kafka versions and then updating the version on Amazon MSK. It is also recommended to measure sustained performance, not peak performance, by running tests for an extended period of time. Smaller brokers in the M5 family can achieve higher throughput and lower latencies by leveraging Amazon Elastic Block Store (Amazon EBS) optimized networks and Amazon EC2 network burst credits.

When running performance testing on Apache Kafka, consider the following:

Producer performance (throughput, latency)
Consumer performance (throughput, latency)

Monitor the following metrics:

Throughput (messages/sec) for size of data.
Throughput (messages/sec) for number of messages.
Total data.
Total messages.

To test performance, download Apache Kafka into a test environment. The package contains shell scripts including kafka-producer-perf-test.sh and kafka-consumer-perf-test.sh. These scripts can be used to test the performance of your producers and consumers.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Establishing reasons for migrating

Migrating to Amazon MSK