AWS CloudHSM Classic
User Guide

This is the user guide for AWS CloudHSM Classic. For the latest version, see the AWS CloudHSM User Guide.

Best Practices for High Availability and Load Balancing

AWS recommends the following best practices for high availability (HA) and load balancing your HSM appliances.

General Best Practices

  • When an HA group is shared by multiple AWS CloudHSM Classic clients, the best practice is for these clients to select different primary HA members, for better fault tolerance and more equal distribution of the workload of cryptographic operations.

For more information, see the following topics in the SafeNet Luna SA documentation:

Best Practices for Loss and Recovery

High-Availability Recovery

High-availability (HA) recovery is hands-off resumption by failed HA group members. Prior to the introduction of this function, the HA feature provided redundancy and performance, but required that a failed/lost group member be manually reinstated. If the HA recovery feature is not switched on, HA still requires manual intervention to reinstate members. A member of a HA group may fail for the following reasons:

  • The HSM appliance loses power, but regains power in less than the two hours that the HSM appliance preserves its activation state.

  • The network connection is lost.

HA recovery works if the following are true:

  • HA autoRecovery is enabled.

  • The HA group has at least two nodes.

  • The HA node is reachable (connected) at startup.

  • The HA node recover retry limit is not reached. If it is reached or exceeded, the only option to restore the downed connections is a manual recovery.

If all HA nodes fail (there are no links from the HSM client), recovery is not possible.

The HA recovery logic in the library makes its first attempt at recovering a failed member when your application makes a call to its HSM appliance (the HA group). In other words, an idle HSM client does not attempt a recovery.

However, a busy HSM client would notice a slight pause every minute, as the library attempts to recover a dropped HA group members until the members are reinstated, or until the retry period has been reached/exceeded and it stops trying. Therefore, set the retry period according to your normal operational situation; for example, the types and durations of network interruptions you experience.

HA autoRecovery is not on by default. It must be explicitly enabled by following the instructions in Enabling Automatic Recovery. For more information about HA and autoRecovery, go to the following topics in the SafeNet Luna SA documentation:

Recovering From the Loss of a Subset of High-Availability Members

If there is a loss of a subset of HA members, AWS recommends the following procedure to recover group members.

When you are notified by AWS that the connection has been recovered, execute the following command to reintroduce disconnected members to the HA group:

vtl haAdmin recover -group <ha_group_label>

AWS also recommends retrying the connection for a short period of time, so that any disconnections caused by transient network outages can be automatically recovered. For example, retry the connection 5 times, at an interval of one try every minute, as shown below.

vtl haAdmin autoRecovery -interval 60 vtl haAdmin autoRecovery -retry 5

If you don't want to recover the group members manually, but still want to minimize the overhead caused by automatic recovery, use the following steps:

To recover group members and minimize recovery overhead

  • Retry the connection once every 3 minutes, until the connection is successful.

    vtl haAdmin autoRecovery -interval 180 vtl haAdmin autoRecovery -retry -1

To recover group members with a special cryptographic application

  • For special cryptographic applications, discuss with SafeNet or AWS on a case-by-case basis.

Recovering From the Loss of All High-Availability Members

If there is a loss of all HA members (there is a complete loss of communication with all the members of your HA group), you can use LunaSlotManager.reinitialize(). If you use LunaSlotManager.reinitialize(), you do not have to restart your applications. Alternately, you can restart your applications and use manual recovery.

For more information about LunaSlotManager.reinitialize(), see LunaProvider: Recovering from the Loss of all HA Members Using LunaSlotManager.reinitialize() in the SafeNet Luna SA Technical Notes.

Important

  • LunaHAStatus.isOK() returns true only when all HA members are present. This method returns false when at least one HA member is missing, and throws an exception when all HA members are missing.

  • The HA-only option has to be enabled to keep the HA slot number unchanged.