Appendix 1: Testing on RHEL Setup - Databases for SAP applications on AWS

Appendix 1: Testing on RHEL Setup

Test Case 1: Manual Failover

Procedure: Use the command pcs resource move <Db2 master resource name>.

     [root@dbprim00 profile] pcs resource move Db2_HADR_STJ-master
     Warning: Creating location constraint cli-ban-Db2_HADR_STJ-master-on-dbprim00 with a score of -INFINITY for resource
     Db2_HADR_STJ-master on-dbprim00 with a score of -INFINITY for resource Db2_HADR_STJ-
     master on node dbprim00.
     This will prevent Db2_HADR_STJ-master from running on dbprim00
     until the constraint is removed. This will be the case even if
     dbprim00 is the last node in the cluster.
     [root@dbprim00 profile]

Expected result: The Db2 primary node is moved from primary node to standby node.

     [root@dbprim00 profile] pcs status
     Cluster name: db2ha
     Stack: corosync
     Current DC: dbsec00 (version 1.1.18-11.el7_5.4-2b07d5c5a9) - partition with quorum
     Last updated: Sat Feb  8 08:54:04 2020
     Last change: Sat Feb  8 08:53:02 2020 by root via crm_resource on dbprim00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     clusterfence   (stonith:fence_aws):    Started dbprim00
         Master/Slave Set: Db2_HADR_STJ-master [Db2_HADR_STJ]
         Masters: [ dbsec00 ]
     Stopped: [ dbprim00 ]
     db2-oip        (ocf::heartbeat:aws-vpc-move-ip):       Started dbsec00

     Daemon Status:
     corosync: active/enabled
     pacemaker: active/enabled
     pcsd: active/enabled
     [root@dbprim00 profile]

Followup actions: Remove the location constraint.

When using a manual command for moving the resource, there is location constraint created on the node (in this case, the primary node) that prevents running the Db2 resource in standby mode.

To remove the location constraint:

  1. Use the following command to remove the location constraint:

          pcs config show
         Location Constraints:
         Resource: Db2_HADR_STJ-master
         Disabled on: dbprim00 (score:-INFINITY) (role: Started) (id:cli-ban-Db2_HADR_STJ-master-on-dbprim00)
    
         [root@dbprim00 profile] pcs constraint delete cli-ban-Db2_HADR_STJ-master-on-dbprim00
  2. Start the Db2 instance as standby on the new standby node, logged in as db2<sid>. Next, clean up the error logged in as root.

         db2stj> db2start
         02/08/2020 09:11:29     0   0   SQL1063N  DB2START processing was successful.
         SQL1063N  DB2START processing was successful.
    
         db2stj> db2 start hadr on database STJ as standby
         DB20000I  The START HADR ON DATABASE command completed successfully.
    
         [root@dbprim00 ~] pcs resource cleanup
         Cleaned up all resources on all nodes
         [root@dbprim00 ~] pcs status
         Cluster name: db2ha
         Stack: corosync
         Current DC: dbsec00 (version 1.1.18-11.el7_5.4-2b07d5c5a9) - partition with quorum
         Last updated: Sat Feb  8 09:13:17 2020
         Last change: Sat Feb  8 09:12:26 2020 by hacluster via crmd on dbprim00
    
         2 nodes configured
         4 resources configured
    
         Online: [ dbprim00 dbsec00 ]
    
         Full list of resources:
    
         clusterfence   (stonith:fence_aws):    Started dbprim00
         Master/Slave Set: Db2_HADR_STJ-master [Db2_HADR_STJ]
              Masters: [ dbsec00 ]
              Slaves: [ dbprim00 ]
         db2-oip        (ocf::heartbeat:aws-vpc-move-ip):       Started dbsec00
    
         Daemon Status:
              corosync: active/enabled
              pacemaker: active/enabled
              pcsd: active/enabled
         [root@dbprim00 ~]

Test Case 2: Shut Down the Primary EC2 Instance

Procedure: Using AWS Console or CLI to stop the EC2 instance and simulate EC2 failure.

Expected result: The Db2 primary node is moved to the standby server.

     [root@dbsec00 db2stj] pcs status
     Cluster name: db2ha
     Stack: corosync
     Current DC: dbsec00 (version 1.1.18-11.el7_5.4-2b07d5c5a9) - partition with quorum
     Last updated: Sat Feb  8 09:44:16 2020
     Last change: Sat Feb  8 09:31:39 2020 by hacluster via crmd on dbsec00

     2 nodes configured
     4 resources configured

     Online: [ dbsec00 ]
     OFFLINE: [ dbprim00 ]

     Full list of resources:

     clusterfence   (stonith:fence_aws):    Started dbsec00
     Master/Slave Set: Db2_HADR_STJ-master [Db2_HADR_STJ]
          Masters: [ dbsec00 ]
          Stopped: [ dbprim00 ]
     db2-oip        (ocf::heartbeat:aws-vpc-move-ip):       Started dbsec00

     Daemon Status:
          corosync: active/enabled
          pacemaker: active/enabled
          pcsd: active/enabled

Followup action: Start the EC2 instance and then start Db2 as standby on the standby instance as you did in Test Case 1. Do not include location constraint removal this time.

Test Case 3: Stop the Db2 Instance on the Primary Instance

Procedure: Log in to the Db2 primary instance as db2<sid> (db2stj) and run db2stop force.

     db2stj> db2stop force
     02/12/2020 12:40:03     0   0   SQL1064N  DB2STOP processing was successful.
     SQL1064N  DB2STOP processing was successful.

Expected result: The Db2 primary node is failed over to standby server. The standby node continues to be on the old primary in a stopped state. There is a failed monitoring action.

     [root@dbsec00 db2stj] pcs status
     Cluster name: db2ha
     Stack: corosync
     Current DC: dbsec00 (version 1.1.18-11.el7_5.4-2b07d5c5a9) - partition with quorum
     Last updated: Wed Feb 12 16:55:56 2020
     Last change: Wed Feb 12 13:58:11 2020 by hacluster via crmd on dbsec00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     clusterfence   (stonith:fence_aws):    Started dbsec00
     Master/Slave Set: Db2_HADR_STJ-master [Db2_HADR_STJ]
         Masters: [ dbsec00 ]
         Stopped: [ dbprim00 ]
     db2-oip        (ocf::heartbeat:aws-vpc-move-ip):       Started dbsec00

    Failed Actions:
    * Db2_HADR_STJ_start_0 on dbprim00 'unknown error' (1): call=34, status=complete, exitreason='',
    last-rc-change='Wed Feb 12 16:55:32 2020', queued=1ms, exec=6749ms


    Daemon Status:
       corosync: active/enabled
       pacemaker: active/enabled
    pcsd: active/enabled
    [root@dbsec00 db2stj]

Followup action: Start the EC2 instance, then start Db2 as standby on the standby instance as you did in Test Case 2. Clear the failed monitoring error.

Test Case 4: End the Db2 Process (db2sysc) on the Node that Runs the Primary Database

Procedure: Log in to the Db2 primary instance as root and then run ps –ef|grep db2sysc. Note the process ID (PID) and then end it.

     [root@dbprim00 ~] ps -ef|grep db2sysc
     root      5809 30644  0 18:54 pts/1    00:00:00 grep --color=auto
     db2sysc
     db2stj   26982 26980  0 17:12 pts/0    00:00:28 db2sysc 0
     [root@dbprim00 ~] kill -9 26982

Expected result: The Db2 primary node is failed over to the standby server. The standby node is in the old primary in a stopped state.

     [root@dbprim00 ~] pcs status
     Cluster name: db2ha
     Stack: corosync
     Current DC: dbsec00 (version 1.1.18-11.el7_5.4-2b07d5c5a9) - partition with quorum
     Last updated: Wed Feb 12 18:54:50 2020
     Last change: Wed Feb 12 18:53:12 2020 by hacluster via crmd on dbsec00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     clusterfence   (stonith:fence_aws):    Started dbsec00
     Master/Slave Set: Db2_HADR_STJ-master [Db2_HADR_STJ]
         Masters: [ dbsec00 ]
         Stopped: [ dbprim00 ]
     db2-oip        (ocf::heartbeat:aws-vpc-move-ip):       Started dbsec00

     Failed Actions:
     * Db2_HADR_STJ_start_0 on dbprim00 'unknown error' (1): call=57, status=complete, exitreason='',
    last-rc-change='Wed Feb 12 18:54:37 2020', queued=0ms, exec=6777ms


     Daemon Status:
        corosync: active/enabled
        pacemaker: active/enabled
        pcsd: active/enabled

Followup action: Start the EC2 instance and start Db2 as standby on the standby instance, as you did in Test Case 2. Clear the failed monitoring alert.

Test Case 5: End the Db2 Process (db2sysc) on the Node that Runs the Standby Database

Procedure: Log in to the Db2 standby instance as root and run ps –ef|grep db2sysc. Note the PID and then end it.

     [root@dbsec00 db2stj] ps -ef|grep db2sysc
     db2stj   24194 24192  1 11:55 pts/1    00:00:01 db2sysc 0
     root     26153  4461  0 11:57 pts/0    00:00:00 grep --color=auto
     db2sysc
     [root@dbsec00 db2stj] kill -9 24194

Expected result: The db2sysc process is restarted on the Db2 standby instance. There is a monitoring failure event record in the cluster.

     [root@dbprim00 ~] pcs status
     Cluster name: db2ha
     Stack: corosync
     Current DC: dbsec00 (version 1.1.18-11.el7_5.4-2b07d5c5a9) - partition with quorum
     Last updated: Fri Feb 14 11:59:22 2020
     Last change: Fri Feb 14 11:55:54 2020 by hacluster via crmd on dbsec00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     clusterfence   (stonith:fence_aws):    Started dbsec00
     Master/Slave Set: Db2_HADR_STJ-master [Db2_HADR_STJ]
         Masters: [ dbprim00 ]
         Slaves: [ dbsec00 ]
     db2-oip        (ocf::heartbeat:aws-vpc-move-ip):       Started dbprim00

     Failed Actions:
     * Db2_HADR_STJ_monitor_20000 on dbsec00 'not running' (7): call=345, status=complete, exitreason='',
    last-rc-change='Fri Feb 14 11:57:57 2020', queued=0ms, exec=0ms


     Daemon Status:
        corosync: active/enabled
        pacemaker: active/enabled
        pcsd: active/enabled


     [root@dbsec00 db2stj] ps -ef|grep db2sysc
     db2stj   26631 26629  1 11:57 ?        00:00:01 db2sysc 0
     root     27811  4461  0 11:58 pts/0    00:00:00 grep --color=auto db2sysc

Follow-up action: Clear the monitoring error.

Test Case 6: Simulating a Crash of the Node that Runs the Primary Db2

Procedure: Log in to the Db2 primary instance as root and run echo 'c' > /proc/sysrq-trigger.

     [root@dbprim00 ~] echo 'c' > /proc/sysrq-trigger
     ───────────────────────────────────────────────────────────────────────────────────────────────────────

     Session stopped
         - Press <return> to exit tab
         - Press R to restart session
         - Press S to save terminal output to file

     Network error: Software caused connection abort

Expected result: The primary Db2 should failover to standby node. The standby is in a stopped state on the previous primary.

     [root@dbsec00 ~] pcs status
     Cluster name: db2ha
     Stack: corosync
     Current DC: dbsec00 (version 1.1.18-11.el7_5.4-2b07d5c5a9) - partition with quorum
     Last updated: Fri Feb 21 15:38:43 2020
     Last change: Fri Feb 21 15:33:17 2020 by hacluster via crmd on dbsec00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

      clusterfence   (stonith:fence_aws):    Started dbsec00
      Master/Slave Set: Db2_HADR_STJ-master [Db2_HADR_STJ]
         Masters: [ dbsec00 ]
         Stopped: [ dbprim00 ]
      db2-oip        (ocf::heartbeat:aws-vpc-move-ip):       Started dbsec00

     Failed Actions:
     * Db2_HADR_STJ_start_0 on dbprim00 'unknown error' (1): call=15, status=complete, exitreason='',
     last-rc-change='Fri Feb 21 15:38:31 2020', queued=0ms, exec=7666ms


     Daemon Status:
        corosync: active/enabled
        pacemaker: active/enabled
        pcsd: active/enabled

Followup action: Start the EC2 instance and then start Db2 as standby on the standby instance as you did in Test Case 2. Clear the monitoring error.