Common replication errors
This section describes common replication errors and possible explanations and potential mitigations.
Replication errors
Agent not seen
This error indicates that the AWS Elastic Disaster Recovery service has lost communication with the AWS Replication Agent. Use the following steps to diagnose the issue.
Not converging
This error message (NOT_CONVERGING) could indicate an inadequate replication speed.
-
Follow the instructions on calculating the required bandwidth.
Failback client not seen
This error message (FAILBACK_CLIENT_NOT_SEEN) could indicate that there’s a network connectivity issue and that the Failback Client is unable to communicate with the AWS DRS endpoint. Check network connectivity.
Snapshot failure
This error message (SNAPSHOTS_FAILURE) indicates that the service is unable to take a consistent snapshot.
This can be caused by:
-
Inadequate IAM permissions – Ensure that you have the required IAM permissions (attached to the required IAM roles).
-
API throttling – Check if you have activated throttling. If throttling is not activated, check your CloudTrail logs for throttling errors.
Unstable network
This error message (UNSTABLE_NETWORK) may indicate that there are network issues. Check your connectivity, then run the network bandwidth test.
Failed to download replication software to failback client
This error message (FAILED_TO_DOWNLOAD_REPLICATION_SOFTWARE_TO_FAILBACK_CLIENT) may indicate that there are connectivity issues. Check your connectivity to the S3 endpoint and try again.
If the issue persists, you might have a proxy or a network security appliance filtering your traffic and blocking the software download.
Failed to configure replication software
This error message (FAILED_TO_CONFIGURE_REPLICATION_SOFTWARE) may appear for multiple reasons. Try again and if the issue persists, contact AWS support.
Failed to establish communication with recovery instance
This message (FAILED_TO_ESTABLISH_RECOVERY_INSTANCE_COMMUNICATION) could indicate communication issues. Ensure that the Failback Client is able to communicate with the recovery instance.
If you are utilizing public network, (no VPN, no direct connect, and more), ensure that your recovery instance has a public IP. By default, AWS DRS launch template deactivates public IP, and recovery instances are only launched with private IPs.
Failed to connect AWS replication Agent to replication software
This error message (FAILED_TO_PAIR_AGENT_WITH_REPLICATION_SOFTWARE) may indicate a pairing issue. AWS DRS needs to provide the replication server and agent with information to allow them to communicate. Make sure there is network connectivity between the agent, replication server, and the AWS DRS endpoint.
If the issue persists, contact support.
Failed to establish communication with replication software
This error message (FAILED_TO_ESTABLISH_AGENT_REPLICATOR_SOFTWARE_COMMUNICATION) may suggest that there are network connectivity issues. Make sure you have network connectivity between the agent, replication server and the AWS DRS endpoint.
If this message appears during failback, ensure that TCP port 1500 is opened inbound on the recovery instance.
Failed to create firewall rules
This error message (Firewall rules creation failed) can be caused by several reasons.
-
Ensure that the IAM permission prerequisites are met.
-
Review the replication settings of the associated source server.
Failed to authenticate with service
This error message (FAILED_TO_AUTHENTICATE_WITH_SERVICE) may indicate a communication issue between the replication server and the DRS endpoint on TCP Port 443. Check the subnet you selected and ensure that TCP Port 443 is open from your replication server.
Failed to create staging disks
This error message (Failed to create staging disks) may indicate that your AWS account is configured to encrypted EBS disks but the IAM user does not have the required permissions to encrypt using the selected KMS key. Ensure that the IAM prerequisites are met.
Failed to pair the replication agent with replication server
This error message (Failed to pair replication agent with replication server) may be caused by multiple reasons. Make sure that you have connectivity between the replication agent, the replication server, and the DRS endpoint. If the issue persists, contact Support.
Failed to launch replication server
This error message (FAILED_TO_LAUNCH_REPLICATION_SERVER) indicates that AWS Elastic Disaster Recovery was unable to launch a replication server in the staging area.
Failed to boot replication server
This error message (FAILED_TO_BOOT_REPLICATION_SERVER) indicates that the replication server was launched but failed to boot successfully.
Verify that the staging area subnet has outbound connectivity on TCP port 443 to the AWS Elastic Disaster Recovery regional endpoint.
Check the staging area security group and network ACL settings.
If the issue persists, contact AWS Support.
Failed to attach staging disks
This error message (FAILED_TO_ATTACH_STAGING_DISKS) indicates that AWS Elastic Disaster Recovery was unable to attach the staging disks to the replication server.
Verify that the IAM permissions prerequisites are met, including permissions for Amazon EC2 volume operations.
Check your EBS volume limits in the staging area Region.
If the issue persists, contact AWS Support.
Failed to connect AWS Replication Agent to replication server
This error message (FAILED_TO_CONNECT_AGENT_TO_REPLICATION_SERVER) indicates that the agent on the source server was unable to establish a data replication connection with the replication server over TCP port 1500.
Failed to start data transfer
This error message (FAILED_TO_START_DATA_TRANSFER) indicates that the replication agent and replication server were paired but data transfer could not begin.
Check network connectivity and bandwidth between the source server and the replication server.
Check the replication agent logs for additional details.
If the issue persists, contact AWS Support.
Unknown data replication error
Unknown errors (unknown_error) can occur for any number of reasons. There are several steps you can take to attempt to mitigate the issue:
-
Check connectivity.
-
Check throttling.
-
Check performance issue on the replication server.
-
Check the network bandwidth between the agent and the replication server.
Replication lag issues
Potential solutions:
-
Make sure that the source server is up and running.
-
Make sure that AWS Elastic Disaster Recovery services are up and running.
-
Make sure that TCP Port 1500 is not blocked outbound from the Source server to the replication server.
-
If the MAC address of the Source had changed, that would require a reinstallation of the AWS Replication Agent.
-
If the source machine was rebooted recently or the AWS Elastic Disaster Recovery services were restarted, the disks are reread after this and until it’s finished, the lag will grow.
-
If the source machine had a spike of write operations, the lag will grow until AWS Elastic Disaster Recovery service manages to flush all the written data to the drill or recovery instance replication server.