Best Practice 11.2 – Define an approach to maintain availability

Maintain availability by having a resilient architecture that can sustain the failure of a single technical component or AWS service. Implement mechanisms, which could include redundant capacity, load balancing, and software clusters.

Suggestion 11.2.1 – Avoid failures due to exhausted resources or service deterioration

Investigate over-provisioning of resources, proactive monitoring of growth, and throttling usage by setting limits.

The operational excellence pillar covers the different ways in which you can understand the state of your SAP application and ensure that the appropriate actions are taken, see [Operational Excellence]: 1 - Design SAP workload to allow understanding and reaction to its state.

The performance pillar can assist with guidance on right-sizing and scaling capacity [Performance]: 16 - Understand ongoing performance and optimization options.

Suggestion 11.2.2 – Have a strategy for scheduled maintenance

If your business has a requirement to minimize scheduled outages, you should develop a strategy for maintenance at all levels – SAP application, database, operating system, and AWS. Consider the following:

Use of replication and cluster solutions to alternate the primary and secondary node.
Excess capacity and mechanisms to scale up and down to facilitate rolling outages.
Use of a live patching approach for the operating system, if possible.
- SUSE Linux Enterprise Live Patching
- Red Hat Reducing downtime for SAP HANA Whitepaper
AWS Documentation: AWS Systems Manager Patch Manager Patch Groups
SAP Note: 1913302 - HANA: Suspend DB connections for short maintenance tasks [Requires SAP Portal Access]
SAP Note: 2077934 - Rolling kernel switch in HA environments [Requires SAP Portal Access]
SAP Note: 953653 - Rolling Kernel Switch [Requires SAP Portal Access]
SAP Note: 2254173 - Linux: Rolling Kernel Switch in Pacemaker-based NetWeaver HA environments [Requires SAP Portal Access]

You should also evaluate the elastic capabilities of AWS services to reduce the overall downtime of scheduled maintenance by temporarily increasing performance. For example, scaling up the size of the Amazon EC2 instance running your database to provide more CPU and storage throughput for upgrade activities, or switching your EBS volumes type from gp2 to io2 to improve storage throughput during a database reorganization.

Suggestion 11.2.3 – Protect SAP single points of failure with software clusters or other mechanisms

You can use a high availability (HA) clustering solution for autonomous failover of SAP single points of failure (SAP Central Services and database) across Availability Zones.

There are multiple SAP-certified clustering solutions listed on the SAP website. SAP clustering solutions are supported by the cluster software vendors themselves, not by SAP. SAP only certifies the solution. Any custom-built solution is not certified and will need to be supported by the solution builder.

If you choose not to use a clustering solution for your single points of failure, consider scripting or runbooks to minimize the errors associated with restoring services.

Suggestion 11.2.4 – Consider redundant capacity or automatic scaling for components that support it

Evaluate static, dynamic, or scheduled capacity changes to match your usage. Examine the minimum capacity requirements and how they would be impacted by failures and maintenance. Overprovision where appropriate to allow time to recover from failure.

If you need to maintain 100% capacity in the event of an AZ failure, then you should consider deploying the application tier across three AZs, each with 50% of the total required capacity.

In addition to deploying the SAP Application Server Layer across multiple AZs, you could consider scaling solutions such as the one described in the following SAP on AWS Blog post that leverages the capabilities of Amazon EC2 Auto Scaling.

SAP on AWS Blog: Using AWS to enable SAP Application Auto Scaling
AWS Documentation: Amazon EC2 Instance Types for SAP
SAP Note: 1656099 - SAP Applications on AWS: Supported DB/OS and Amazon EC2 products [Requires SAP Portal Access]

Suggestion 11.2.5 – Ensure the availability of capacity for all identified failure scenarios

The following are examples of failure scenarios that could be used to guide your analysis. Granularity and coverage of the scenarios, classification, and impact will vary depending on your requirements and architecture.

Failure scenario examples	Comparative Risk of Occurrence
Planned / Controlled Maintenance	Planned
Resource exhausted or compromised (High CPU utilization / File system full / Out of memory / Storage issues)	Medium
Distributed stateless component failure (for example, web dispatchers)	Medium
Distributed stateful component failure (for example, application servers)	Medium
Single point of failure (Database / SAP Central Services)	Medium
AZ / Network failure	Low
Core service failure (DNS / Amazon EFS / API calls)	Low / Medium
Corruption / Accidental deletion / Malicious activities / Faulty code deployment	Low
Region failure	Very Low

Further guidance on capacity reservations is available in [Reliability]: Suggestion 10.2.5 - Investigate strategies for ensuring capacity and in the AWS whitepaper: Architecture Guidance for Availability and Reliability of SAP on AWS.

You can review what Reserved Instances you have available within your AWS account using the Reserved Instances section of the Amazon EC2 console. You can review what On-Demand Capacity Reservations you have available using the Capacity Reservations section of the Amazon EC2 console.

Suggestion 11.2.6 – Use AWS services that have inherent availability where applicable

Several AWS services have inherent availability as part of their design and run across multiple Availability Zones for high availability. Some of the relevant services used in an SAP context include:

AWS Service: Amazon EFS
AWS Service: Elastic Load Balancing
AWS Service: Route 53
AWS Service: AWS Transit Gateway
AWS Service: Amazon S3
AWS Service: Amazon FSx

In addition, components that use stateless services, such as bastian hosts or SAProuter, can use Auto Scaling Groups to achieve high availability.

Suggestion 11.2.7 -– Follow AWS best practices to ensure network connectivity

Evaluate one or more of the following AWS best practices to ensure the resilience of network connectivity to the AWS Region in use:

AWS Documentation: AWS Direct Connect Resiliency Toolkit
AWS Documentation: AWS VPN CloudHub
AWS Documentation: AWS Cloud WAN

If your cluster solution relies on an overlay IP consider the following to enable access from outside of the VPC:

AWS Documentation: SAP on AWS High Availability with Overlay IP Address Routing

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Best Practice 11.1 – Monitor failures of the SAP application, AWS resources, and connectivity

Best Practice 11.3 – Define an approach to restore service availability