Best practices for monitoring AWS CloudHSM - AWS Prescriptive Guidance

Best practices for monitoring AWS CloudHSM

Monitoring best practices can improve the security, performance, and reliability of your AWS CloudHSM cluster. Follow the AWS CloudHSM monitoring best practices in the AWS CloudHSM documentation. In addition, the following are some key guidelines that can help you effectively monitor your AWS CloudHSM resources:

  • Establish baselines – Set baseline metrics and thresholds for each workload, such as the number of expected session keys, token keys, and open sessions.

  • Alert when workload outcomes are at risk – Configure Amazon CloudWatch alarms to initiate notifications when workload metrics exceed established thresholds. These alerts should notify you of operational or performance issues or potential security risks, such as malicious activity.

  • Review and revise metrics – Regularly review the monitoring data to assess the effectiveness of your established metrics. Metrics should reflect the health and performance of your AWS CloudHSM cluster. Adjust the metrics based on evolving workload patterns and operational requirements. This promotes optimal monitoring and alerting capabilities.

The monitoring and alerting recommendations for AWS CloudHSM in this guide follow the best practices in the following AWS Well-Architected Framework pillars:

  • The operational excellence pillar recommends that your workload should be designed to include telemetry and monitoring. AWS CloudHSM provides the information necessary, such as metrics, logs, events, and traces, for you to understand the state of your resource. For more information, see Monitoring AWS CloudHSM in the AWS CloudHSM documentation. When you operate AWS CloudHSM, you need to be able to understand cluster health, to detect operational events, and to respond to both planned and unplanned events. AWS provides monitoring tools that help you determine when organizational and business outcomes might be at risk so that you can take the appropriate actions at the right time.

  • The performance efficiency pillar recommends that you should monitor the performance of your resources, such as AWS CloudHSM clusters, by gathering, aggregating, and processing performance-related metrics in near real time. You can identify performance degradation and remediate the factors. For example, you might detect hardware security modules (HSMs) in the cluster that are reporting server-side errors. You can raise alarms automatically when measurements are outside of the expected thresholds. We recommend that you use alarms not only for notifications, but also to initiate automated actions in response to the detected events. For example, you can automate an increase in the number of HSMs in the cluster.

  • The reliability pillar defines monitoring and alerting as critical to ensure that you are meeting your availability requirements. Your monitoring solution must be able to detect failures effectively. When it detects problems or failures, its primary objective is to alert on those issues. Implementing continuous observability and monitoring practices is imperative for resilient architectures in the cloud. To improve your workloads, you must be able to measure them and understand their state and health. Design principles for automatic recovery from a failure, horizontal scalability, and capacity provisioning depend on accurate monitoring and alerting services.

  • The security pillar discusses detection and prevention of unexpected or unwanted configuration changes and of unexpected behavior. By default, AWS CloudHSM collects your HSM audit logs and sends them to Amazon CloudWatch Logs on your behalf. You can use the audit logs to monitor the operations that are performed within each HSM in the cluster.