Monitoring an Amazon MSK Cluster - Amazon Managed Streaming for Apache Kafka

Monitoring an Amazon MSK Cluster

Amazon MSK gathers Apache Kafka metrics and sends them to Amazon CloudWatch where you can view them. For more information about Apache Kafka metrics, including the ones that Amazon MSK surfaces, see Monitoring in the Apache Kafka documentation.

You can also monitor your MSK cluster with Prometheus, an open-source monitoring application. For information about Prometheus, see Overview in the Prometheus documentation. To learn how to monitor your cluster with Prometheus, see Open Monitoring with Prometheus.

Amazon MSK Monitoring Levels for CloudWatch Metrics

When creating an Amazon MSK cluster, you can set the enhancedMonitoring property to one of three monitoring levels: DEFAULT, PER_BROKER, or PER_TOPIC_PER_BROKER. The tables in the following section show all the metrics that Amazon MSK makes available starting at each monitoring level.

Amazon MSK Metrics for Monitoring with CloudWatch

Amazon MSK integrates with Amazon CloudWatch metrics so that you can collect, view, and analyze CloudWatch metrics for your Amazon MSK cluster. The metrics that you configure for your MSK cluster are automatically collected and pushed to CloudWatch. The following three tables show the metrics that become available at each of the three monitoring levels.

DEFAULT Level Monitoring

The metrics described in the following table are available at the DEFAULT monitoring level. They are free.

Metrics available at the DEFAULT monitoring level
Name When Visible Dimensions Description
ActiveControllerCount After the cluster gets to the ACTIVE state. Cluster Name Only one controller per cluster should be active at any given time.
CpuIdle After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of CPU idle time.
CpuSystem After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of CPU in kernel space.
CpuUser After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of CPU in user space.
GlobalPartitionCount After the cluster gets to the ACTIVE state. Cluster Name Total number of partitions across all brokers in the cluster.
GlobalTopicCount After the cluster gets to the ACTIVE state. Cluster Name Total number of topics across all brokers in the cluster.
KafkaAppLogsDiskUsed After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of disk space used for application logs.
KafkaDataLogsDiskUsed After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of disk space used for data logs.
MemoryBuffered After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The size in bytes of buffered memory for the broker.
MemoryCached After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The size in bytes of cached memory for the broker.
MemoryFree After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The size in bytes of memory that is free and available for the broker.
MemoryUsed After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The size in bytes of memory that is in use for the broker.
NetworkRxDropped After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of dropped receive packages.
NetworkRxErrors After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of network receive errors for the broker.
NetworkRxPackets After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of packets received by the broker.
NetworkTxDropped After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of dropped transmit packages.
NetworkTxErrors After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of network transmit errors for the broker.
NetworkTxPackets After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of packets transmitted by the broker.
OfflinePartitionsCount After the cluster gets to the ACTIVE state. Cluster Name Total number of partitions that are offline in the cluster.
RootDiskUsed After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of the root disk used by the broker.
SwapFree After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The size in bytes of swap memory that is available for the broker.

SwapUsed

After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The size in bytes of swap memory that is in use for the broker.

ZooKeeperRequestLatencyMsMean

After the cluster gets to the ACTIVE state. Cluster Name, Broker ID Mean latency in milliseconds for ZooKeeper requests from broker.
ZooKeeperSessionState After the cluster gets to the ACTIVE state. Cluster Name, Broker ID Connection status of broker's ZooKeeper session which may be one of the following: NOT_CONNECTED: '0.0', ASSOCIATING: '0.1', CONNECTING: '0.5', CONNECTEDREADONLY: '0.8', CONNECTED: '1.0', CLOSED: '5.0', AUTH_FAILED: '10.0'.

PER_BROKER Level Monitoring

When you set the monitoring level to PER_BROKER, you get the metrics described in the following table in addition to all the DEFAULT level metrics. You pay for the metrics in the following table, whereas the DEFAULT level metrics continue to be free. The metrics in this table have the following dimensions: Cluster Name, Broker ID.

Additional metrics that are available starting at the PER_BROKER monitoring level
Name When Visible Description
BytesInPerSec After you create a topic. The number of bytes per second received from clients.
BytesOutPerSec After you create a topic. The number of bytes per second sent to clients.
FetchConsumerLocalTimeMsMean After there's a producer/consumer. The mean time in milliseconds that the consumer request is processed at the leader.
FetchConsumerRequestQueueTimeMsMean After there's a producer/consumer. The mean time in milliseconds that the consumer request waits in the request queue.
FetchConsumerResponseQueueTimeMsMean After there's a producer/consumer. The mean time in milliseconds that the consumer request waits in the response queue.
FetchConsumerResponseSendTimeMsMean After there's a producer/consumer. The mean time in milliseconds for the consumer to send a response.
FetchConsumerTotalTimeMsMean After there's a producer/consumer. The mean total time in milliseconds that consumers spend on fetching data from the broker.
FetchFollowerLocalTimeMsMean After there's a producer/consumer. The mean time in milliseconds that the follower request is processed at the leader.
FetchFollowerRequestQueueTimeMsMean After there's a producer/consumer. The mean time in milliseconds that the follower request waits in the request queue.
FetchFollowerResponseQueueTimeMsMean After there's a producer/consumer. The mean time in milliseconds that the follower request waits in the response queue.
FetchFollowerResponseSendTimeMsMean After there's a producer/consumer. The mean time in milliseconds for the follower to send a response.
FetchFollowerTotalTimeMsMean After there's a producer/consumer. The mean total time in milliseconds that followers spend on fetching data from the broker.
FetchMessageConversionsPerSec After you create a topic. The number of fetch message conversions per second for the broker.
FetchThrottleByteRate After bandwidth throttling is applied. The number of throttled bytes per second.
FetchThrottleQueueSize After bandwidth throttling is applied. The number of messages in the throttle queue.
FetchThrottleTime After bandwidth throttling is applied. The average fetch throttle time in milliseconds.
LeaderCount After the cluster gets to the ACTIVE state. The number of leader replicas.
MessagesInPerSec After the cluster gets to the ACTIVE state. The number of incoming messages per second for the broker.
NetworkProcessorAvgIdlePercent After the cluster gets to the ACTIVE state. The average percentage of the time the network processors are idle.
PartitionCount After the cluster gets to the ACTIVE state. The number of partitions for the broker.
ProduceLocalTimeMsMean After the cluster gets to the ACTIVE state. The mean time in milliseconds for the follower to send a response.
ProduceMessageConversionsPerSec After you create a topic. The number of produce message conversions per second for the broker.
ProduceMessageConversionsTimeMsMean After the cluster gets to the ACTIVE state. The mean time in milliseconds spent on message format conversions.
ProduceRequestQueueTimeMsMean After the cluster gets to the ACTIVE state. The mean time in milliseconds that request messages spend in the queue.
ProduceResponseQueueTimeMsMean After the cluster gets to the ACTIVE state. The mean time in milliseconds that response messages spend in the queue.
ProduceResponseSendTimeMsMean After the cluster gets to the ACTIVE state. The mean time in milliseconds spent on sending response messages.
ProduceThrottleByteRate After bandwidth throttling is applied. The number of throttled bytes per second.
ProduceThrottleQueueSize After bandwidth throttling is applied. The number of messages in the throttle queue.
ProduceThrottleTime After bandwidth throttling is applied. The average produce throttle time in milliseconds.
ProduceTotalTimeMsMean After the cluster gets to the ACTIVE state. The mean produce time in milliseconds.
RequestBytesMean After the cluster gets to the ACTIVE state. The mean number of request bytes for the broker.
RequestExemptFromThrottleTime After request throttling is applied. The average time in milliseconds spent in broker network and I/O threads to process requests that are exempt from throttling.
RequestHandlerAvgIdlePercent After the cluster gets to the ACTIVE state. The average percentage of the time the request handler threads are idle.
RequestThrottleQueueSize After request throttling is applied. The number of messages in the throttle queue.
RequestThrottleTime After request throttling is applied. The average request throttle time in milliseconds.
RequestTime After request throttling is applied. The average time in milliseconds spent in broker network and I/O threads to process requests.
UnderMinIsrPartitionCount After the cluster gets to the ACTIVE state. The number of under minIsr partitions for the broker.
UnderReplicatedPartitions After the cluster gets to the ACTIVE state. The number of under-replicated partitions for the broker.

PER_TOPIC_PER_BROKER Level Monitoring

When you set the monitoring level to PER_TOPIC_PER_BROKER, you get the metrics described in the following table, in addition to all the metrics from the PER_BROKER and DEFAULT levels. Only the DEFAULT level metrics are free. The metrics in this table have the following dimensions: Cluster Name, Broker ID, Topic.

Important

For an Amazon MSK cluster that uses Apache Kafka 2.4.1 or a newer version, the metrics in the following table appear only after their values become nonzero for the first time. For example, to see BytesInPerSec, one or more producers must first send data to the cluster.

Additional metrics that are available starting at the PER_TOPIC_PER_BROKER monitoring level
Name When Visible Description
BytesInPerSec After you create a topic. The number of bytes received per second.
BytesOutPerSec After you create a topic. The number of bytes sent per second.
FetchMessageConversionsPerSec After you create a topic. The number of fetched messages converted per second.
MessagesInPerSec After you create a topic. The number of messages received per second.
ProduceMessageConversionsPerSec After you create a topic. The number of conversions per second for produced messages.

Viewing Amazon MSK Metrics Using CloudWatch

You can monitor metrics for Amazon MSK using the CloudWatch console, the command line, or the CloudWatch API. The following procedures show you how to access metrics using these different methods.

To access metrics using the CloudWatch console

Sign in to the AWS Management Console and open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  1. In the navigation pane, choose Metrics.

  2. Choose the All metrics tab, and then choose AWS/Kafka.

  3. To view topic-level metrics, choose Topic, Broker ID, Cluster Name; for broker-level metrics, choose Broker ID, Cluster Name; and for cluster-level metrics, choose Cluster Name.

  4. (Optional) In the graph pane, select a statistic and a time period, and then create a CloudWatch alarm using these settings.

To access metrics using the AWS CLI

Use the list-metrics and get-metric-statistics commands.

To access metrics using the CloudWatch CLI

Use the mon-list-metrics and mon-get-stats commands.

To access metrics using the CloudWatch API

Use the ListMetrics and GetMetricStatistics operations.

Consumer-Lag Checking with Burrow

Burrow is a monitoring companion for Apache Kafka that provides consumer-lag checking. Burrow has a modular design that includes the following subsystems:

  • Clusters run an Apache Kafka client that periodically updates topic lists and the current HEAD offset (the most recent offset) for every partition.

  • Consumers fetch information about consumer groups from a repository. This repository can be an Apache Kafka cluster (consuming the __consumer_offsets topic), ZooKeeper, or some other repository.

  • The storage subsystem stores all of this information in Burrow.

  • The evaluator subsystem retrieves information from the storage subsystem for a specific consumer group and calculates the status of that group. This follows the consumer lag evaluation rules.

  • The notifier subsystem requests status on consumer groups according to a configured interval and sends out notifications (Email, HTTP, or some other method) for groups that meet the configured criteria.

  • The HTTP Server subsystem provides an API interface to Burrow for fetching information about clusters and consumers.

For more information about Burrow, see Burrow - Kafka Consumer Lag Checking.

Important

Make sure that Burrow is compatible with the version of Apache Kafka that you are using for your MSK cluster.

To set up and use Burrow with Amazon MSK

Follow this if you use plaintext communication. For TLS, see the next procedure, as well.

  1. Create an MSK cluster and launch a client machine in the same VPC as the cluster. For example, you can follow the instructions at Getting Started Using Amazon MSK.

  2. Run the following command on the EC2 instance that serves as your client machine.

    sudo yum install go
  3. Run the following command on the client machine to get the Burrow project.

    go get github.com/linkedin/Burrow
  4. Run the following command to install dep. It installs it in the /home/ec2-user/go/bin/dep folder.

    curl https://raw.githubusercontent.com/golang/dep/master/install.sh | sh
  5. Go to the /home/ec2-user/go/src/github.com/linkedin/Burrow folder and run the following command.

    /home/ec2-user/go/bin/dep ensure
  6. Run the following command in the same folder.

    go install
  7. Open the /home/ec2-user/go/src/github.com/linkedin/Burrow/config/burrow.toml configuration file for editing. In the following sections of the configuration file, replace the placeholders with the name of your MSK cluster, the host:port pairs for your ZooKeeper servers, and your bootstrap brokers.

    To get your ZooKeeper host:port pairs, describe your MSK cluster and look for the value of ZookeeperConnectString. See Getting the Apache ZooKeeper Connection String for an Amazon MSK Cluster.

    To get your bootstrap brokers, see Getting the Bootstrap Brokers for an Amazon MSK Cluster.

    Follow the formatting shown below when you edit the configuration file.

    [zookeeper] servers=[ "ZooKeeper-host-port-pair-1", "ZooKeeper-host-port-pair-2", "ZooKeeper-host-port-pair-3" ] timeout=6 root-path="/burrow" [client-profile.test] client-id="burrow-test" kafka-version="0.10.0" [cluster.MSK-cluster-name] class-name="kafka" servers=[ "bootstrap-broker-host-port-pair-1", "bootstrap-broker-host-port-pair-2", "bootstrap-broker-host-port-pair-3" ] client-profile="test" topic-refresh=120 offset-refresh=30 [consumer.MSK-cluster-name] class-name="kafka" cluster="MSK-cluster-name" servers=[ "bootstrap-broker-host-port-pair-1", "bootstrap-broker-host-port-pair-2", "bootstrap-broker-host-port-pair-3" ] client-profile="test" group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-).*$" group-whitelist=""
  8. In the go/bin folder run the following command.

    ./Burrow --config-dir /home/ec2-user/go/src/github.com/linkedin/Burrow/config
  9. Check for errors in the bin/log/burrow.log file.

  10. You can use the following command to test your setup.

    curl -XGET 'HTTP://your-localhost-ip:8000/v3/kafka'
  11. For all of the supported HTTP requests and links, see Burrow HTTP Endpoint.

To use Burrow with TLS

In addition to the previous procedure, see the following steps if you are using TLS communication.

  1. Run the following command.

    sudo yum install java-1.8.0-openjdk-devel -y
  2. Run the following command after you adjust the paths as necessary.

    find /usr/lib/jvm/ -name "cacerts" -exec cp {} /tmp/kafka.client.truststore.jks \;
  3. In the next step you use the keytool command, which asks for a password. The default password is changeit. We recommend that you run the following command to change the password before you proceed to the next step.

    keytool -keystore /tmp/kafka.client.truststore.jks -storepass changeit -storepasswd -new Password
  4. Run the following command.

    keytool --list -rfc -keystore /tmp/kafka.client.truststore.jks >/tmp/truststore.pem

    You need truststore.pem for the burrow.toml file that's described later in this procedure.

  5. To generate the certfile and the keyfile, use the code at Managing Client Certificates for Mutual Authentication with Amazon MSK. You need the pem flag.

  6. Set up your burrow.toml file like the following example. You can have multiple cluster and consumer sections to monitor multiple MSK clusters using one burrow cluster. You can also adjust the Apache Kafka version under client-profile. It represents the client version of Apache Kafka to support. For more information, see Client Profile on the Burrow GitHub.

    [general] pidfile="burrow.pid" stdout-logfile="burrow.out" [logging] filename="/tmp/burrow.log" level="info" maxsize=100 maxbackups=30 maxage=10 use-localtime=false use-compression=true [zookeeper] servers=[ "ZooKeeperConnectionString" ] timeout=6 root-path="/burrow" [client-profile.msk1-client] client-id="burrow-test" tls="msk-mTLS" kafka-version="2.0.0" [cluster.msk1] class-name="kafka" servers=[ "BootstrapBrokerString" ] client-profile="msk1-client" topic-refresh=120 offset-refresh=30 [consumer.msk1-cons] class-name="kafka" cluster="msk1" servers=[ "BootstrapBrokerString" ] client-profile="msk1-client" group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-).*$" group-whitelist="" [httpserver.default] address=":8000" [storage.default] class-name="inmemory" workers=20 intervals=15 expire-group=604800 min-distance=1 [tls.msk-mTLS] certfile="/tmp/client_cert.pem" keyfile="/tmp/private_key.pem" cafile="/tmp/truststore.pem" noverify=false

Open Monitoring with Prometheus

You can monitor your MSK cluster with Prometheus, an open-source monitoring system for time-series metric data. You can also use tools that are compatible with Prometheus-formatted metrics or tools that integrate with Amazon MSK Open Monitoring, like Datadog, Lenses, New Relic, and Sumo logic. Open monitoring is available for free but charges apply for the transfer of data across Availability Zones. For information about Prometheus, see the Prometheus documentation.

Creating an Amazon MSK Cluster with Open Monitoring Enabled

Using the AWS Management Console

  1. Sign in to the AWS Management Console, and open the Amazon MSK console at https://console.aws.amazon.com/msk/home?region=us-east-1#/home/.

  2. In the Monitoring section, select the check box next to Enable open monitoring with Prometheus.

  3. Provide the required information in all the sections of the page, and review all the available options.

  4. Choose Create cluster.

Using the AWS CLI

  • Invoke the create-cluster command and specify its open-monitoring option. Enable the JmxExporter, the NodeExporter, or both. If you specify open-monitoring, the two exporters can't be disabled at the same time.

Using the API

  • Invoke the CreateCluster operation and specify OpenMonitoring. Enable the jmxExporter, the nodeExporter, or both. If you specify OpenMonitoring, the two exporters can't be disabled at the same time.

Enabling Open Monitoring for an Existing Amazon MSK Cluster

To enable open monitoring, make sure that the cluster is in the ACTIVE state.

Using the AWS Management Console

  1. Sign in to the AWS Management Console, and open the Amazon MSK console at https://console.aws.amazon.com/msk/home?region=us-east-1#/home/.

  2. Choose the name of the cluster that you want to update. This takes you to the Details page for the cluster.

  3. On the Details tab, scroll down to find the Monitoring section.

  4. Choose Edit.

  5. Select the check box next to Enable open monitoring with Prometheus.

  6. Choose Save changes.

Using the AWS CLI

  • Invoke the update-monitoring command and specify its open-monitoring option. Enable the JmxExporter, the NodeExporter, or both. If you specify open-monitoring, the two exporters can't be disabled at the same time.

Using the API

  • Invoke the UpdateMonitoring operation and specify OpenMonitoring. Enable the jmxExporter, the nodeExporter, or both. If you specify OpenMonitoring, the two exporters can't be disabled at the same time.

Setting Up a Prometheus Host on an Amazon EC2 Instance

  1. Download the Prometheus server from https://prometheus.io/download/#prometheus to your Amazon EC2 instance.

  2. Extract the downloaded file to a directory and go to that directory.

  3. Create a file with the following contents and name it prometheus.yml.

    # file: prometheus.yml # my global config global: scrape_interval: 10s # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' static_configs: # 9090 is the prometheus server port - targets: ['localhost:9090'] - job_name: 'broker' file_sd_configs: - files: - 'targets.json'
  4. Use the ListNodes operation to get a list of your cluster's brokers.

  5. Create a file named targets.json with the following JSON. Replace broker_dns_1, broker_dns_2, and the rest of the broker DNS names with the DNS names you obtained for your brokers in the previous step. Include all of the brokers you obtained in the previous step. Amazon MSK uses port 11001 for the JMX Exporter and port 11002 for the Node Exporter.

    [ { "labels": { "job": "jmx" }, "targets": [ "broker_dns_1:11001", "broker_dns_2:11001", . . . "broker_dns_N:11001" ] }, { "labels": { "job": "node" }, "targets": [ "broker_dns_1:11002", "broker_dns_2:11002", . . . "broker_dns_N:11002" ] } ]
  6. To start the Prometheus server on your Amazon EC2 instance, run the following command in the directory where you extracted the Prometheus files and saved prometheus.yml and targets.json.

    ./prometheus
  7. Find the IPv4 public IP address of the Amazon EC2 instance where you ran Prometheus in the previous step. You need this public IP address in the following step.

  8. To access the Prometheus web UI, open a browser that can access your Amazon EC2 instance, and go to Prometheus-Instance-Public-IP:9090, where Prometheus-Instance-Public-IP is the public IP address you got in the previous step.

Prometheus Metrics

All metrics emitted by Apache Kafka to JMX are accessible using open monitoring with Prometheus. For information about Apache Kafka metrics, see Monitoring in the Apache Kafka documentation. In addition, you can use the Prometheus Node Exporter to get CPU and disk metrics for the broker nodes.