Menu
Amazon Relational Database Service
User Guide (API Version 2014-10-31)

Managing an Amazon Aurora DB Cluster

The following sections discuss managing performance, scaling, fault tolerance, backup, and restoring for an Amazon Aurora DB cluster.

Managing Performance and Scaling for Aurora DB Cluster

Scaling Storage

Aurora storage automatically scales with the data in your cluster volume. As your data grows, your cluster volume storage will grow in 10 gigabyte (GB) increments up to 64 TB. The size of your cluster volume is checked on an hourly basis to determine your storage costs. For pricing information, see the Amazon RDS product page.

Scaling Aurora DB Instances

You can scale Aurora DB instances in two ways, instance scaling and read scaling.

Instance Scaling

You can scale your DB cluster as needed by modifying the DB instance class for each DB instance in the cluster. Aurora supports several DB instance classes optimized for Aurora. The following table describes the instance class specifications.

Instance ClassvCPUECUMemory (GB)

db.r3.large

2

6.515

db.r3.xlarge

4

1330.5

db.r3.2xlarge

8

2661

db.r3.4xlarge

16

52122

db.r3.8xlarge

32

104244

Read Scaling

You can achieve read scaling for your Aurora DB cluster by creating up to 15 Aurora Replicas in the DB cluster. Each Aurora Replica returns the same data from the cluster volume with minimal replica lag—usually considerably less than 100 milliseconds after the primary instance has written an update. As your read traffic increases, you can create additional Aurora Replicas and connect to them directly to distribute the read load for your DB cluster. Aurora Replicas don't have to be of the same DB instance class as the primary instance.

Maximum Connections to an Aurora DB Instance

The maximum number of connections allowed to an Aurora DB instance is determined by the max_connections parameter in the instance-level parameter group for the DB instance. By default, this value is set to the following equation (the log function represents log base 2): log(DBInstanceClassMemory/8187281408)*1000. Setting the max_connections parameter to this equation makes sure that the number of allowed connection scales well with the size of the instance. For example, suppose your DB instance class is db.r3.xlarge, which has 30.5 gigabytes (GB) of memory. Then the maximum connections allowed is 2000, as shown in the following equation:

log( (30.5 * 1073741824) / 8187281408 ) * 1000 = 2000

You can increase the maximum number of connections to your Aurora DB instance by scaling the instance up to a DB instance class with more memory, or by setting a larger value for the max_connections parameter, up to 16,000.

Fault Tolerance for an Aurora DB Cluster

An Aurora DB cluster is fault tolerant by design. The cluster volume spans multiple Availability Zones in a single region, and each Availability Zone contains a copy of the cluster volume data. This functionality means that your DB cluster can tolerate a failure of an Availability Zone without any loss of data and only a brief interruption of service.

If the primary instance in a DB cluster fails, Aurora automatically fails over to a new primary instance in one of two ways:

  • By promoting an existing Aurora Replica to the new primary instance

  • By creating a new primary instance

If the DB cluster has one or more Aurora Replicas, then an Aurora Replica is promoted to the primary instance during a failure event. A failure event results in a brief interruption, during which read and write operations fail with an exception. However, service is typically restored in less than 120 seconds, and often less than 60 seconds. To increase the availability of your DB cluster, we recommend that you create at least one or more Aurora Replicas in two or more different Availability Zones.

You can customize the order in which your Aurora Replicas are promoted to the primary instance after a failure by assigning each replica a priority. Priorities range from 0 for the highest priority to 15 for the lowest priority. If the primary instance fails, Amazon RDS promotes the Aurora Replica with the highest priority to the new primary instance. You can modify the priority of an Aurora Replica at any time. Modifying the priority doesn't trigger a failover.

More than one Aurora Replica can share the same priority, resulting in promotion tiers. If two or more Aurora Replicas share the same priority, then Amazon RDS promotes the replica that is largest in size. If two or more Aurora Replicas share the same priority and size, then Amazon RDS promotes an arbitrary replica in the same promotion tier.

If the DB cluster doesn't contain any Aurora Replicas, then the primary instance is recreated during a failure event. A failure event results in an interruption during which read and write operations fail with an exception. Service is restored when the new primary instance is created, which typically takes less than 10 minutes. Promoting an Aurora Replica to the primary instance is much faster than creating a new primary instance.

Note

Amazon Aurora also supports replication with an external MySQL database, or an RDS MySQL DB instance. For more information, see Replication Between Aurora and MySQL or Between Aurora and Another Aurora DB Cluster.

Backing Up and Restoring an Aurora DB Cluster

The following sections discuss Aurora backups and how to restore your Aurora DB cluster using the AWS Management Console.

Backups

Aurora backs up your cluster volume automatically and retains restore data for the length of the backup retention period. Aurora backups are continuous and incremental so you can quickly restore to any point within the backup retention period. No performance impact or interruption of database service occurs as backup data is being written. You can specify a backup retention period, from 1 to 35 days, when you create or modify a DB cluster.

If you want to retain a backup beyond the backup retention period, you can also take a snapshot of the data in your cluster volume. Storing snapshots incurs the standard storage charges for Amazon RDS. For more information about RDS storage pricing, see Amazon Relational Database Service Pricing.

Because Aurora retains incremental restore data for the entire backup retention period, you only need to create a snapshot for data that you want to retain beyond the backup retention period. You can create a new DB cluster from the snapshot.

Restoring Data

You can recover your data by creating a new Aurora DB cluster from the backup data that Aurora retains, or from a DB cluster snapshot that you have saved. A new copy of a DB cluster created from backup data can be quickly restored to any point in time during your backup retention period. The continuous and incremental nature of Aurora backups during the backup retention period means you don't need to take frequent snapshots of your data in order to improve restore times.

To determine the latest or earliest restorable time for a DB instance, look for the Latest Restorable Time or Earliest Restorable Time values on the RDS console. The latest restorable time for a DB cluster is the most recent point at which you can restore your DB cluster, typically within 5 minutes of the current time. The earliest restorable time specifies how far back within the backup retention period that you can restore your cluster volume.

You can determine when the restore of a DB cluster is complete by checking the Latest Restorable Time or Earliest Restorable Time. The Latest Restorable Time and Earliest Restorable Time values will return NULL until the restore operation is complete. You cannot request a backup or restore operation if the Latest Restorable Time or Earliest Restorable Time values return NULL.

To restore a DB cluster to a specified time using the AWS Management Console

  1. Open the Amazon RDS for Aurora console at https://console.aws.amazon.com/rds.

  2. In the left navigation pane, click Instances. Click to select the primary instance for the DB cluster that you want to restore.

  3. Click Instance Actions, and then click Restore To Point In Time.

    In the Restore DB Cluster window, click to select the Use Custom Restore Time option.

  4. Type the date and time that you want to restore to in the Use Custom Restore Time boxes.

  5. Type a name for the new, restored DB instance in the DB Instance Identifier box.

  6. Click the Launch DB Cluster button to launch the restored DB cluster.

Testing Amazon Aurora Using Fault Injection Queries

You can test the fault tolerance of your Amazon Aurora DB cluster by using fault injection queries. Fault injection queries are issued as SQL commands to an Amazon Aurora instance and they enable you to schedule a simulated occurrence of one of the following events:

  • A crash of the master instance or an Aurora Replica

  • A failure of an Aurora Replica

  • A disk failure

  • Disk congestion

Fault injection queries that specify a crash force a crash of the Amazon Aurora instance. The other fault injection queries result in simulations of failure events, but don't cause the event to occur. When you submit a fault injection query, you also specify an amount of time for the failure event simulation to occur for.

You can submit a fault injection query to one of your Aurora Replica instances by connecting to the endpoint for the Aurora Replica. For more information, see Aurora Endpoints.

Testing an Instance Crash

You can force a crash of an Amazon Aurora instance using the ALTER SYSTEM CRASH fault injection query.

Syntax

ALTER SYSTEM CRASH [ INSTANCE | DISPATCHER | NODE ];

Options

This fault injection query takes one of the following crash types:

  • INSTANCEA crash of the MySQL-compatible database for the Amazon Aurora instance is simulated.

  • DISPATCHERA crash of the dispatcher on the master instance for the Aurora DB cluster. The dispatcher writes updates to the cluster volume for an Amazon Aurora DB cluster is simulated.

  • NODEA crash of both the MySQL-compatible database and the dispatcher for the Amazon Aurora instance is simulated. For this fault injection simulation, the cache is also deleted.

The default crash type is INSTANCE.

Testing an Aurora Replica Failure

You can simulate the failure of an Aurora Replica using the ALTER SYSTEM SIMULATE READ REPLICA FAILURE fault injection query.

An Aurora Replica failure will block all requests to an Aurora Replica or all Aurora Replicas in the DB cluster for a specified time interval. When the time interval completes, the affected Aurora Replicas will be automatically synced up with master instance.

Syntax

ALTER SYSTEM SIMULATE percentage_of_failure PERCENT

READ REPLICA FAILURE [ TO ALL | TO "replica name" ]

FOR INTERVAL quantity [ YEAR | QUARTER | MONTH | WEEK| DAY | HOUR | MINUTE | SECOND ];

Options

This fault injection query takes the following parameters:

  • percentage_of_failureThe percentage of requests to block during the failure event. This value can be a double between 0 and 100. If you specify 0, then no requests are blocked. If you specify 100, then all requests are blocked.

  • Failure type—The type of failure to simulate. Specify TO ALL to simulate failures for all Aurora Replicas in the DB cluster. Specify TO and the name of the Aurora Replica to simulate a failure of a single Aurora Replica. The default failure type is TO ALL.

  • quantityThe amount of time for which to simulate the Aurora Replica failure. The interval is an amount followed by a time unit. The simulation will occur for that amount of the specified unit. For example, 20 MINUTE will result in the simulation running for 20 minutes.

    Note

    Take care when specifying the time interval for your Aurora Replica failure event. If you specify too long of a time interval, and your master instance writes a large amount of data during the failure event, then your Aurora DB cluster might assume that your Aurora Replica has crashed and replace it.

Testing a Disk Failure

You can simulate a disk failure for an Aurora DB cluster using the ALTER SYSTEM SIMULATE DISK FAILURE fault injection query.

During a disk failure simulation, the Aurora DB cluster randomly marks disk segments as faulting. Requests to those segments will be blocked for the duration of the simulation.

Syntax

ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE

[ IN DISK index | NODE index ]

FOR INTERVAL quantity [ YEAR | QUARTER | MONTH | WEEK| DAY | HOUR | MINUTE | SECOND ];

Options

This fault injection query takes the following parameters:

  • percentage_of_failureThe percentage of the disk to mark as faulting during the failure event. This value can be a double between 0 and 100. If you specify 0, then none of the disk is marked as faulting. If you specify 100, then the entire disk is marked as faulting.

  • DISK index or NODE indexA specific disk or node to simulate the failure event for. If you exceed the range of indexes for the disk or node, you will receive an error that tells you the maximum index value that you can specify.

  • quantityThe amount of time for which to simulate the disk failure. The interval is an amount followed by a time unit. The simulation will occur for that amount of the specified unit. For example, 20 MINUTE will result in the simulation running for 20 minutes.

Testing Disk Congestion

You can simulate a disk failure for an Aurora DB cluster using the ALTER SYSTEM SIMULATE DISK CONGESTION fault injection query.

During a disk congestion simulation, the Aurora DB cluster randomly marks disk segments as congested. Requests to those segments will be delayed between the specified minimum and maximum delay time for the duration of the simulation.

Syntax

ALTER SYSTEM SIMULATE percentage_of_failure PERCENT

DISK CONGESTION BETWEEN minimum AND maximum MILLISECONDS

[ IN DISK index | NODE index ]

FOR INTERVAL quantity [ YEAR | QUARTER | MONTH | WEEK| DAY | HOUR | MINUTE | SECOND ];

Options

This fault injection query takes the following parameters:

  • percentage_of_failureThe percentage of the disk to mark as congested during the failure event. This value can be a double between 0 and 100. If you specify 0, then none of the disk is marked as congested. If you specify 100, then the entire disk is marked as congested.

  • DISK index or NODE indexA specific disk or node to simulate the failure event for. If you exceed the range of indexes for the disk or node, you will receive an error that tells you the maximum index value that you can specify.

  • minimum and maximumThe minimum and maximum amount of congestion delay, in milliseconds. Disk segments marked as congested will be delayed for a random amount of time within the range of the minimum and maximum amount of milliseconds for the duration of the simulation.

  • quantityThe amount of time for which to simulate the disk congestion. The interval is an amount followed by a time unit. The simulation will occur for that amount of the specified time unit. For example, 20 MINUTE will result in the simulation running for 20 minutes.