Amazon Managed Streaming for Apache Kafka (Amazon MSK)
Infrastructure
Before you can create an Amazon MSK cluster, you need to have an
Amazon Virtual Private Cloud
In this reference architecture, the brokers and Apache ZooKeeper nodes are deployed into an AWS service-managed VPC, dedicated to each cluster. You can limit access to the Apache ZooKeeper nodes by assigning them a separate security group. The brokers in the cluster are made accessible to clients in the VPC through elastic network interfaces that appear in the customer account. Traffic between clients and brokers is private by default, meaning it doesn't transfer over the public internet. The Security Groups on the network interfaces activate protocol and port-based traffic filtering on the brokers.
The IP addresses from the customer VPC are attached to the network interfaces, and by default all network traffic stays within the AWS network and is not accessible to the internet. Connections between clients and an Amazon MSK cluster stay private.
Amazon MSK provides control-plane operations for creating, updating, and deleting clusters. Customers use Apache Kafka data-plane operations for producing and consuming data. For a complete list of operations, see the Amazon MSK API Reference Guide. For more information on cluster provisioning, see Getting Started Using Amazon MSK.
At the time of this publication, Amazon MSK supports instance types from the M and T instance families. For the most up-to-date list of supported instance types, see Broker types.
Reliability
The following reliability best practices are recommended when running Amazon MSK:
-
Deploy Amazon MSK clusters across three Availability Zones to ensure high availability. This is the default cluster provisioning configuration.
-
Ensure a replication factor (RF) of 3 or at least 2 for each topic. A RF of 1 can lead to offline partitions during a rolling update or broker failure scenarios.
-
Set minimum in-sync replicas (minISR) to RF -1. A minISR that is equal to the RF can prevent producing messages to the cluster during a rolling update. With a minISR of 2, three-way replicated topics are available when one replica is offline.
-
Upgrade to the recommended version of Apache Kafka on Amazon MSK. Upgrading your cluster software ensures that you are using the latest stable features from open source applications, including:
-
Performance enhancements that help applications run faster.
-
Apache Kafka bug, and security fixes, and feature enhancements.
-
-
Depending on the scale of your workload, evaluate the use of multiple clusters. Amazon MSK uses a rolling broker update process for many of its operations, including version updates and updating broker types. Operations are carried out one broker at a time and can vary in cycle time. By using multiple clusters, you can parallelize these operations, reducing overall time to completion per cluster. In addition, you can reduce the impact of a cluster-level issue such as an improper configuration by segmenting the workload into multiple clusters.
Security
Amazon MSK has a range of tools and methods to help secure your data processing in the AWS Cloud.
Network Security
By default, clients can privately connect to an Amazon MSK cluster within their Amazon VPC, with traffic traversing only within the AWS network.
-
Enable Transport Layer Security (TLS) encryption between brokers and clients.
-
Use Security Group rules to control the protocols and port numbers allowed on your brokers.
-
Use Apache Kafka APIs to configure your cluster properties and restrict Apache ZooKeeper access with dedicated security groups.
Authentication and Authorization
Amazon MSK provides various methods for access management. AWS
recommends using
AWS Identity and Access Management
Alternatively, you can use mutual Transport Layer Security (mTLS) with ACM Private CA or SASL/SCRAM authentication with Apache Kafka Access Control List (ACL) authorization.
Logging and Auditing
Amazon MSK makes it easy for you to troubleshoot clients and analyze communication by streaming Apache Kafka broker logs to Amazon CloudWatch
Encryption
-
Encrypt data-at-rest using Customer-managed CMK
.
-
Amazon MSK recommends enabling encryption-in-transit.
Cost
With Amazon MSK, there are no minimum
fees
You pay for the time your broker instances run and the storage you use monthly. Additionally, outside of Amazon MSK costs, there could be data transfer charges applied.
Note
Amazon MSK now supports Apache Kafka version 2.4.1
-
Use the Amazon MSK sizing and pricing
spreadsheet to correctly size the cluster. -
Configure the message retention period or log size to minimize storage costs.
-
Track Amazon MSK cluster costs with custom tagging.
-
Use T-family broker instance types for low-cost development, for testing small-to-medium streaming workloads, or low-throughput streaming workloads that experience temporary spikes in throughput. M5 brokers have higher baseline throughput performance than T3 brokers and are recommended for production workloads.
Performance
Amazon MSK is designed to provide performance that is identical to
an Apache Kafka cluster running on
Amazon Elastic Compute Cloud
-
Use Amazon MSK sizing and pricing spreadsheet
to right-size your cluster for performance. -
Test your workloads after provisioning the cluster.
-
Use M5 broker instance types for production workloads.
-
Follow the Number of partitions per broker guide to get the expected performance. For guidance on partition counts, see the Kafka documentation
. A high number of partitions will lead to higher CPU usage and can lead to a delay in metrics availability or cluster downtime. Decrease the number of partitions by deleting unused topics or migrating to a larger instance type for your brokers. -
Set minimum in-sync replicas (minISR) to at most RF - 1. A minISR that is equal to the RF can prevent producing to the cluster during a rolling update. With a minISR of 2, three-way replicated topics are available when one replica is offline.
Capacity
When building out capacity for an Amazon MSK cluster, consider the following two key factors.
Instance type for the broker
When creating an Amazon MSK cluster, you can specify the instance type of the brokers at cluster launch. You can start with a few brokers within an Amazon MSK cluster. Then, using the AWS Management Console or the AWS Command Line Interface (AWS CLI), you can scale up to 90 brokers per account and 30 brokers per cluster. These are soft limits, and can be adjusted by requesting a quota increase. Alternatively, you can scale your clusters by changing the size or family of your Apache Kafka brokers, which is a best practice because you don't need to reassign Apache Kafka partitions in order to scale up or down.
The M5 brokers have higher baseline throughput performance and
support more partitions per broker than the T3 brokers, and are
therefore recommended for production workloads. For more
information about M5 instance types, see
Amazon EC2 M5 Instances
T3 brokers have the ability to use CPU credits to temporarily
burst performance. Use T3 brokers for low-cost development if
you are testing small to medium streaming workloads, or if you
have low-throughput streaming workloads that experience
temporary spikes in throughput. For more information about T3
instance types, see
Amazon EC2 T3 Instances
The broker instance type will affect the maximum number of partitions that each broker can support. This recommendation provides the recommended maximum number of partitions (including replica partitions) per broker. Perform testing to determine the right instance type for your brokers.
Number of brokers per cluster
Use the
Amazon
MSK Sizing and Pricing
Input parameters in the spreadsheet
Basic parameters:
-
Ingestion rate (MBs): Number of messages/second divided by average message size going into the cluster.
-
Replication factor: replication factor setting for Apache Kafka.
-
Total data out (MBs): similar to ingestion rate, but for messages leaving the cluster.
-
Retention: how long is data retained in cluster in hours. It is recommended to choose an average retention period across topics for the sizing exercise.
-
Encryption in-transit: TLS encryption in-transit activated. TLS encryption consumes additional CPU.
-
EBS disk utilization: Desired percentage usage of EBS volumes. It is good to have some headroom to scale. You can also leverage the auto expansion feature to set target disk utilization and maximum scaling capacity.
Advanced parameters:
-
Number of AZs: AWS recommends three AZs.
-
Nearest replica fetching (Apache Kafka 2.4.x): reduce latency and cross-AZ data transfer.
-
Lagging consumer percentage.
-
Broker Safety Factor: Number of brokers that can go offline without affecting performance.
-
RAM data cached, in seconds.
Based on these inputs, the following outputs are returned:
-
Storage load (required, provisioned storage and Total IOPS)
-
Networking (Read/Write throughput, MB/s)
-
Data Transfer (Cross-AZ and inter-broker replication traffic)
-
Memory (RAM, GB)
Based on these outputs and the instance type selection in the summary section of the Sizing and Pricing spreadsheet, depending on production/test instances and number of partitions, the spreadsheet will provide the recommended number of brokers. The Sizing and Pricing spreadsheet is the result of running a test workload with three consumers, and ensuring that P99 write latencies are below 100 ms. This may not necessarily reflect your workload or performance expectations. Use this spreadsheet as a starting point and run performance tests for your workloads after provisioning the cluster.
It is possible to change the number of brokers in the Amazon MSK cluster after creation. After resizing, the cluster will need to be rebalanced by reassigning partitions to the new brokers.
Monitoring
There are two monitoring options: Amazon CloudWatch and Open Monitoring for Prometheus. If you already ingest metrics from Apache Kafka into a metrics solution such as DataDog or New Relic, it is recommended to use Open Monitoring with Prometheus to maintain compatibility throughout your migration, and supplementing these metrics by those offered in Amazon CloudWatch. The Open Monitoring with Prometheus option also provides an integration with Cruise Control to manage Apache Kafka partitions.
With Amazon CloudWatch, you can set the monitoring level for an Amazon MSK cluster to one of the following:
-
DEFAULT
-
PER_BROKER
-
PER_TOPIC_PER_BROKER
-
PER_TOPIC_PER_PARTITION
Note
The DEFAULT
level is included with the Amazon MSK cluster whereas there is a
cost associated with the other monitoring levels. The PER_TOPIC_PER_PARTITION
level is only relevant for consumer lag.
Consumer-lag metrics quantify the difference between the latest data written to your topics and the data read by your applications. Amazon MSK provides the following consumer-lag metrics, which you can get through Amazon CloudWatch or through Open Monitoring with Prometheus:
-
OffsetLag
: Partition-level consumer lag in number of offsets. -
MaxOffsetLag
: The maximum offset lag across all partitions in a topic. -
EstimatedMaxTimeLag
: Time estimate (in seconds) to drainMaxOffsetLag
. -
EstimatedTimeLag
: Time estimate (in seconds) to drain the partition offset lag. -
SumOffsetLag
: The aggregated offset lag for all the partitions in a topic.
Recommended metrics to monitor with Amazon CloudWatch
-
KafkaDataLogsDiskUsed
: Cluster should not reach or exceed 85% of disk space usage. Higher disk space usage can impact receiving and transmitting. Configure an Amazon CloudWatch alarm that monitorsKafkaDataLogsDiskUsed
metrics and alerts you if disk space used is too high.To mitigate storage issues, you can do one of the following:
-
Seamlessly scale up the amount of storage provisioned per broker
-
Create an auto scaling policy to automatically expand your storage
-
Change the retention period of messages and logs.
-
Delete unused topics
.
-
-
UnderMinIsrPartitionCount
: The In-Sync Replication (ISR) count indicates the set of replicas up-to-date with the leader. The expected value forUnderMinIsrPartitionCount
is zero. -
ActiveControllerCount
: Monitor ActiveControllerCount and set an alarm to alert you in the eventActiveControllerCount
(Maximum) statistics is greater than 1. -
OfflinePartitionsCount
: This metric reports the number of partitions that don’t have an active leader and are hence not writable or readable. You should configure alert on a value of <0. -
UnderReplicatedPartitions
: The number of in-sync replicas (ISRs) should be equal to the total number of replicas. As a general best practice, you should set an alert using a value greater than 0. -
UnderMinIsrPartitionCount
: This metric should not be greater than zero and you should set an alert on a non-zero value. -
CPU
: Amazon MSK strongly recommends that you maintain the total CPU utilization for your brokers under 60%. -
RequestTime
: The average time in milliseconds spent in broker network and I/O threads to process requests. Indicates the percentage of time spent in broker network and I/O threads to process requests from client group. -
BytesInPerSec
/BytesOutPerSec
: The number of bytes per second received from clients and sent to clients.
For more information, see Amazon MSK Metrics for Monitoring with CloudWatch.
Testing
To minimize risks when migrating from a self-managed Apache Kafka
cluster to Amazon MSK, AWS recommends migrating between identical
Apache Kafka versions and then
updating
the version on Amazon MSK. It is also recommended to measure
sustained performance, not peak performance, by running tests for
an extended period of time. Smaller brokers in the M5 family can
achieve higher throughput and lower latencies by leveraging
Amazon Elastic Block Store
When running performance testing on Apache Kafka, consider the following:
-
Producer performance (throughput, latency)
-
Consumer performance (throughput, latency)
Monitor the following metrics:
-
Throughput (messages/sec) for size of data.
-
Throughput (messages/sec) for number of messages.
-
Total data.
-
Total messages.
To test performance, download Apache
Kafkakafka-producer-perf-test.sh
and kafka-consumer-perf-test.sh
. These
scripts can be used to test the performance