Best Practice 11.2 – Define an approach to maintain availability
Maintain availability by having a resilient architecture that can sustain the failure of a single technical component or AWS service. Implement mechanisms, which could include redundant capacity, load balancing, and software clusters.
Suggestion 11.2.1 – Avoid failures due to exhausted resources or service deterioration
Investigate over-provisioning of resources, proactive monitoring of growth, and throttling usage by setting limits.
The operational excellence pillar covers the different ways in which you can understand the state of your SAP application and ensure that the appropriate actions are taken, see [Operational Excellence]: 1 - Design SAP workload to allow understanding and reaction to its state.
The performance pillar can assist with guidance on right-sizing and scaling capacity [Performance]: 16 - Understand ongoing performance and optimization options.
Suggestion 11.2.2 – Have a strategy for scheduled maintenance
If your business has a requirement to minimize scheduled outages, you should develop a strategy for maintenance at all levels – SAP application, database, operating system, and AWS. Consider the following:
-
Use of replication and cluster solutions to alternate the primary and secondary node.
-
Excess capacity and mechanisms to scale up and down to facilitate rolling outages.
-
Use of a live patching approach for the operating system, if possible.
-
AWS Documentation: AWS Systems Manager Patch Manager Patch Groups
-
SAP Note: 1913302 - HANA: Suspend DB connections for short maintenance tasks
[Requires SAP Portal Access] -
SAP Note: 2077934 - Rolling kernel switch in HA environments
[Requires SAP Portal Access] -
SAP Note: 953653 - Rolling Kernel Switch
[Requires SAP Portal Access] -
SAP Note: 2254173 - Linux: Rolling Kernel Switch in Pacemaker-based NetWeaver HA environments
[Requires SAP Portal Access]
You should also evaluate the elastic capabilities of AWS services to reduce the
overall downtime of scheduled maintenance by temporarily increasing performance. For
example, scaling up the size of the Amazon EC2 instance running your database to provide more CPU
and storage throughput for upgrade activities, or switching your EBS volumes type from
gp2
to io2
to improve storage throughput during a database
reorganization.
Suggestion 11.2.3 – Protect SAP single points of failure with software clusters or other mechanisms
You can use a high availability (HA) clustering solution for autonomous failover of SAP single points of failure (SAP Central Services and database) across Availability Zones.
There are multiple SAP-certified clustering solutions listed on
the SAP website
If you choose not to use a clustering solution for your single points of failure, consider scripting or runbooks to minimize the errors associated with restoring services.
Suggestion 11.2.4 – Consider redundant capacity or automatic scaling for components that support it
Evaluate static, dynamic, or scheduled capacity changes to match your usage. Examine the minimum capacity requirements and how they would be impacted by failures and maintenance. Overprovision where appropriate to allow time to recover from failure.
If you need to maintain 100% capacity in the event of an AZ failure, then you should consider deploying the application tier across three AZs, each with 50% of the total required capacity.
In addition to deploying the SAP Application Server Layer across multiple AZs, you
could consider scaling solutions such as the one described in the following SAP on AWS
Blog post that leverages the capabilities of Amazon EC2 Auto Scaling
-
SAP on AWS Blog: Using AWS to enable SAP Application Auto Scaling
-
AWS Documentation: Amazon EC2 Instance Types for SAP
-
SAP Note: 1656099 - SAP Applications on AWS: Supported DB/OS and Amazon EC2 products
[Requires SAP Portal Access]
Suggestion 11.2.5 – Ensure the availability of capacity for all identified failure scenarios
The following are examples of failure scenarios that could be used to guide your analysis. Granularity and coverage of the scenarios, classification, and impact will vary depending on your requirements and architecture.
Failure scenario examples | Comparative Risk of Occurrence |
---|---|
Planned / Controlled Maintenance | Planned |
Resource exhausted or compromised (High CPU utilization / File system full / Out of memory / Storage issues) | Medium |
Distributed stateless component failure (for example, web dispatchers) | Medium |
Distributed stateful component failure (for example, application servers) | Medium |
Single point of failure (Database / SAP Central Services) | Medium |
AZ / Network failure | Low |
Core service failure (DNS / Amazon EFS / API calls) | Low / Medium |
Corruption / Accidental deletion / Malicious activities / Faulty code deployment | Low |
Region failure | Very Low |
Further guidance on capacity reservations is available in [Reliability]: Suggestion 10.2.5 - Investigate strategies for ensuring capacity and in the AWS whitepaper: Architecture Guidance for Availability and Reliability of SAP on AWS.
You can review what Reserved Instances you have available within your AWS account using the Reserved Instances section of the Amazon EC2 console. You can review what On-Demand Capacity Reservations you have available using the Capacity Reservations section of the Amazon EC2 console.
Suggestion 11.2.6 – Use AWS services that have inherent availability where applicable
Several AWS services have inherent availability as part of their design and run across multiple Availability Zones for high availability. Some of the relevant services used in an SAP context include:
-
AWS Service: Amazon EFS
-
AWS Service: Elastic Load Balancing
-
AWS Service: Route 53
-
AWS Service: AWS Transit Gateway
-
AWS Service: Amazon S3
-
AWS Service: Amazon FSx
In addition, components that use stateless services, such as bastian hosts or SAProuter, can use Auto Scaling Groups to achieve high availability.
Suggestion 11.2.7 -– Follow AWS best practices to ensure network connectivity
Evaluate one or more of the following AWS best practices to ensure the resilience of network connectivity to the AWS Region in use:
-
AWS Documentation: AWS Direct Connect Resiliency Toolkit
-
AWS Documentation: AWS VPN CloudHub
-
AWS Documentation: AWS Cloud WAN
If your cluster solution relies on an overlay IP consider the following to enable access from outside of the VPC:
-
AWS Documentation: SAP on AWS High Availability with Overlay IP Address Routing