Cross-Region data replication metrics in Amazon CloudWatch - Amazon MQ

Cross-Region data replication metrics in Amazon CloudWatch

The Amazon MQ for ActiveMQ cross-Region data replication feature offers metrics for maintaining the reliability, availability, and performance of your primary and replica brokers. During the replication process, a replica broker in a secondary Region receives asynchronously replicated data from the primary broker in the primary Region. If the primary broker in the primary Region fails, you can promote the replica broker in the secondary Region to primary by initiating a switchover or failover. For instructions on viewing metrics in Amazon CloudWatch, see Accessing CloudWatch metrics for Amazon MQ.

CRDR timestamps

The following timestamps describe how the metrics found in Amazon CloudWatch are calculated. There are five timestamps in the data replication process:

  • Time of current observation (TCO): The current instant in time.

  • Time of creation (TC): The instant in time an event was created on the replication queue by the primary broker. Available on both primary and replica brokers.

  • Time of delivery (TD): The instant in time an event was successfully delivered to the replica broker. Only available on replica brokers.

  • Time of processing (TP): The instant in time an event was successfully processed by the replica broker. Only available on replica brokers.

  • Time of acknowledgement (TA): The instant in time an event was successfully acknowledged by the primary broker. Only available on primary brokers.

Estimate switchover/failover performance with CRDR CloudWatch metrics

Amazon MQ enables metrics for your broker by default. You can view your broker metrics by accessing the Amazon CloudWatch console, or by using the CloudWatch API. The following metrics are useful for understanding the replication and switchover/failover performance of your CRDR brokers:

Amazon MQ CloudWatch metric Reason for CRDR use
TotalReplicationLag The estimated time between TA and TC of the last unacknowledged event on the primary broker.
ReplicationLag The estimated time between TP and TC of the last unacknowledged event on the replica broker.
PrimaryWaitTime The estimated time between TCO and TC of the last processed event on the primary broker.
ReplicaWaitTime The estimated time between TCO and TP of the last processed event on the replica broker.
QueueSize The total number of unacknowledged events in the replication queue on the primary broker.

TotalReplicationLag and ReplicationLag describe the delayed replication between the primary and replica brokers. The two metrics can also be used to estimate the time until the ongoing switchover or failover operation complete.

PrimaryWaitTime and ReplicaWaitTime can be used to identify any ongoing issues with the replication process. If the value of the metric is constantly growing, this can indicate the replication process is degraded or paused. Slow replication may happen due issues like to network partitioning, broker starts, and long recovery.