Designing for fault tolerance and quick recovery - Amazon Aurora MySQL Database Administrator’s Handbook

Designing for fault tolerance and quick recovery

In large-scale database operations, you’re statistically more likely to experience issues such as connection interruptions or hardware failures. You must also take operational actions more frequently, such as scaling, adding, or removing DB instances and performing software upgrades.

The only scalable way of addressing this challenge is to assume that issues and changes will occur and design your applications accordingly.

Examples:

  • If Aurora MySQL detects that the primary instance has failed, it can promote a new primary instance and fail over to it, which typically happens within 30 seconds. Your application should be designed to recognize the change quickly and without manual intervention.

  • If you create additional Aurora Replicas in an Aurora DB cluster, your application should automatically recognize the new Aurora Replicas and send traffic to them.

  • If you remove instances from a DB cluster, your application should not try to connect to them.

Test your applications extensively and prepare a list of assumptions about how the application should react to database events. Then, experimentally validate the assumptions.

If you don’t follow best practices, database events (for example, failovers, scaling, and software upgrades) might result in longer than expected downtime. For example, you might notice that a failover took 30 seconds (per the DB cluster’s event notifications) but the application remained down for much longer.