Troubleshooting MSK Replicator - Amazon Managed Streaming for Apache Kafka

Troubleshooting MSK Replicator

The following information can help you troubleshoot problems that you might have with MSK Replicator. You can also post your issue to AWS re:Post.

MSK Replicator state goes from CREATING to FAILED

Here are some common causes for MSK Replicator creation failure.

  1. Verify that the security groups you provided for the Replicator creation in the Target cluster section have outbound rules to allow traffic to your target cluster's security groups. Also verify that your target cluster's security groups have inbound rules that accept traffic from the security groups you provide for the Replicator creation in the Target cluster section. See Choose your target cluster.

  2. If you are creating Replicator for cross-region replication, verify that your source cluster has multi-VPC connectivity turned on for IAM Access Control authentication method. See Amazon MSK multi-VPC private connectivity in a single Region. Also verify that the cluster policy is setup on the source cluster so that the MSK Replicator can connect to the source cluster. See Step 1: Prepare the Amazon MSK source cluster.

  3. Verify that the IAM role that you provided during MSK Replicator creation has the permissions required to read and write to your source and target clusters. Also, verify that the IAM role has permissions to write to topics. See Configure replicator settings and permissions

  4. Verify that your network ACLs are not blocking the connection between the MSK Replicator and your source and target clusters.

  5. It's possible that source or target clusters are not fully available when the MSK Replicator tried to connect to them. This might be due to excessive load, disk usage or CPU usage, which causes the Replicator to be unable to connect to the brokers. Fix the issue with the brokers and retry Replicator creation.

After you have performed the validations above, create the MSK Replicator again.

MSK Replicator appears stuck in the CREATING state

Sometimes MSK Replicator creation can take up to 30 minutes. Wait for 30 minutes and check the state of the Replicator again.

MSK Replicator is not replicating data or replicating only partial data

Follow these steps to troubleshoot data replication problems.

  1. Verify that your Replicator is not running into any authentication errors using the AuthError metric provided by MSK Replicator in CloudWatch. If this metric is above 0, check if the policy of the IAM role you provided for the replicator is valid and there aren't deny permissions set for the cluster permissions. Based on clusterAlias dimension, you can identify if the source or target cluster is experiencing authentication errors.

  2. Verify that your source and target clusters are not experiencing any issues. It is possible that the Replicator is not able to connect to your source or target cluster. This might happen due to too many connections, disk at full capacity or high CPU usage.

  3. Verify that your source and target clusters are reachable from MSK Replicator using the KafkaClusterPingSuccessCount metric in CloudWatch. Based on clusterAlias dimension, you can identify if the source or target cluster is experiencing auth errors. If this metric is 0 or has no datapoint, the connection is unhealthy. You should check network and IAM role permissions that MSK Replicator is using to connect to your clusters.

  4. Verify that your Replicator is not running into failures due to missing topic-level permissions using the ReplicatorFailure metric in CloudWatch. If this metric is above 0, check the IAM role you provided for topic-level permissions.

  5. Verify that the regular expression you provided in the allow list while creating the Replicator matches the names of the topics you want to replicate. Also, verify that the topics are not being excluded from replication due to a regular expression in the deny list.

  6. Note that it may take up to 30 seconds for the Replicator to detect and create the new topics or topic partitions on the target cluster. Any messages produced to the source topic before the topic has been created on the target cluster will not be replicated if replicator starting position is latest (default). Alternatively, you can start replication from the earliest offset in the source cluster topic partitions if you want to replicate existing messages on your topics on the target cluster. See Configure replicator settings and permissions.

Message offsets in the target cluster are different than the source cluster

As part of replicating data, MSK Replicator consumes messages from the source cluster and produces them to the target cluster. This can lead to messages having different offsets on your source and target clusters. However, if you have turned on consumer groups offsets syncing during Replicator creation, MSK Replicator will automatically translate the offsets while copying the metadata so that after failing over to the target cluster, your consumers can resume processing from near where they left off in the source cluster.

MSK Replicator is not syncing consumer groups offsets or consumer group does not exist on target cluster

Follow these steps to troubleshoot metadata replication problems.

  1. Verify that your data replication is working as expected. If not, see MSK Replicator is not replicating data or replicating only partial data.

  2. Verify that the regular expression you provided in the allow list while creating the Replicator matches the names of the consumer groups you want to replicate. Also, verify that the consumer groups are not being excluded from replication due to a regular expression in the deny list.

  3. Verify that MSK Replicator has created the topic on the target cluster. It may take up to 30 seconds for the Replicator to detect and create the new topics or topic partitions on the target cluster. Any messages produced to the source topic before the topic has been created on the target cluster will not be replicated if the replicator starting position is latest (default). If your consumer group on the source cluster has only consumed the mesages that have not been replicated by MSK Replicator, the consumer group will not be replicated to the target cluster. After the topic is successfuly created on the target cluster, MSK Replicator will start replicating newly written messages on the source cluster to the target. Once your consumer group starts reading these messages from the source, MSK Replicator will automatically replicate the consumer group to the target cluster. Alternatively, you can start replication from the earliest offset in the source cluster topic partitions if you want to replicate existing messages on your topics on the target cluster. See Configure replicator settings and permissions.

Note

MSK Replicator optimizes consumer groups offset syncing for your consumers on the source cluster which are reading from a position closer to the end of the topic partition. If your consumer groups are lagging on the source cluster, you may see higher lag for those consumer groups on the target as compared to the source. This means after failover to the target cluster, your consumers will reprocess more duplicate messages. To reduce this lag, your consumers on the source cluster would need to catch up and start consuming from the tip of the stream (end of the topic partition). As your consumers catch up, MSK Replicator will automatically reduce the lag.

Replication latency is high or keeps increasing

Here are some common causes for high replication latency.

  1. Verify that you have the right number of partitions on your source and target MSK clusters. Having too few or too many partitions can impact performance. For guidance on choosing the number of partitions, see Best practices for using MSK Replicator. The following table shows the recommended minimum number of partitions for getting the throughput you want with MSK Replicator.

    Throughput and recommended minimum number of partitions
    Throughput (MB/s) Minimum number of partitions required
    50 167
    100 334
    250 833
    500 1666
    1000 3333
  2. Verify that you have enough read and write capacity in your source and target MSK clusters to support the replication traffic. MSK Replicator acts as a consumer for your source cluster (egress) and as a producer for your target cluster (ingress). Therefore, you should provision cluster capacity to support the replication traffic in addition to other traffic on your clusters. See Best practices for using MSK Replicator for guidance on sizing your MSK clusters.

  3. Replication latency might vary for MSK clusters in different source and destination AWS Region pairs, depending on how geographically far apart the clusters are from each other. For example, Replication latency is typically lower when replicating between clusters in the Europe (Ireland) and Europe (London) Regions compared to replication between clusters in the Europe (Ireland) and Asia Pacific (Sydney) Regions.

  4. Verify that your Replicator is not getting throttled due to overly aggressive quotas set on your source or target clusters. You can use the ThrottleTime metric provided by MSK Replicator in CloudWatch to see the average time in milliseconds a request was throttled by brokers on your source/target cluster. If this metric is above 0, you should adjust Kafka quotas to reduce throttling so that Replicator can catch-up. See Managing MSK Replicator throughput using Kafka quotas for information on managing Kafka quotas for the Replicator.

  5. ReplicationLatency and MessageLag might increase when an AWS Region becomes degraded. Use the AWS Service Health Dashboard to check for an MSK service event in the Region where your primary MSK cluster is located. If there's a service event, you can temporarily redirect your application reads and writes to the other Region.