Appendix 2: Testing on SLES Setup - Databases for SAP applications on AWS

Appendix 2: Testing on SLES Setup

Test Case 1: Manual Failover

Procedure: Use the command crm resource move <Db2 primary resource name> force to move the primary Db2 instance to standby node.

     dbprim00:  crm resource move msl_db2_db2stj_STJ force
     INFO: Move constraint created for rsc_db2_db2stj_STJ

Expected result: The Db2 primary node is moved from the primary node (dbprim00) to the standby node (dbsec00).

     dbprim00:~  crm status
     Stack: corosync
     Current DC: dbsec00 (version 1.1.19+20181105.ccd6b5b10-3.16.1-1.1.19+20181105.ccd6b5b10) - partition with quorum
     Last updated: Sat Apr 25 19:03:20 2020
     Last change: Sat Apr 25 19:02:26 2020 by root via crm_resource on dbprim00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     res_AWS_STONITH        (stonith:external/ec2): Started dbsec00
     Master/Slave Set: msl_db2_db2stj_STJ [rsc_db2_db2stj_STJ]
         Masters: [ dbsec00 ]
         Stopped: [ dbprim00 ]
     res_AWS_IP     (ocf::suse:aws-vpc-move-ip):    Started dbsec00

Follow-up actions: Start the Db2 instance as standby on the new standby node, logged in as db2<sid>. Clean up the error logged in as root.

     db2stj> db2start
     04/25/2020 19:05:27     0   0   SQL1063N  DB2START processing was successful.
     SQL1063N  DB2START processing was successful.

     db2stj> db2 start hadr on database STJ as standby
     DB20000I  The START HADR ON DATABASE command completed successfully.

Remove location constraint: When using a manual command to move the resource, there is a location constraint created on the node (in this case primary node) which is run, preventing the Db2 resource from running in standby mode.

Use the following command to remove the location constraint.

       dbprim00:  crm resource clear msl_db2_db2stj_STJ

Once the constraint is removed, the standby instance starts automatically.

       dbprim00:  crm status
     Stack: corosync
     Current DC: dbsec00 (version 1.1.19+20181105.ccd6b5b10-3.16.1-1.1.19+20181105.ccd6b5b10) - partition with quorum
     Last updated: Sat Apr 25 19:05:29 2020
     Last change: Sat Apr 25 19:05:18 2020 by root via crm_resource on dbprim00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     res_AWS_STONITH        (stonith:external/ec2): Started dbsec00
     Master/Slave Set: msl_db2_db2stj_STJ [rsc_db2_db2stj_STJ]
          Masters: [ dbsec00 ]
         Slaves: [ dbprim00 ]
     res_AWS_IP     (ocf::suse:aws-vpc-move-ip):    Started dbsec00

Test Case 2: Shut Down the Primary EC2 Instance

Procedure: Using AWS console or CLI, stop the EC2 instance to simulate EC2 failure.

Expected Result: The Db2 primary node is moved to a standby server (dbsec00).

     dbsec00:~  crm status
     Stack: corosync
     Current DC: dbsec00 (version 1.1.19+20181105.ccd6b5b10-3.16.1-1.1.19+20181105.ccd6b5b10) - partition with quorum
     Last updated: Sat Apr 25 19:19:32 2020
     Last change: Sat Apr 25 19:18:16 2020 by root via crm_resource on dbprim00

     2 nodes configured
     4 resources configured

     Online: [ dbsec00 ]
     OFFLINE: [ dbprim00 ]

     Full list of resources:

     res_AWS_STONITH        (stonith:external/ec2): Started dbsec00
     Master/Slave Set: msl_db2_db2stj_STJ [rsc_db2_db2stj_STJ]
         Masters: [ dbsec00 ]
         Stopped: [ dbprim00 ]
     res_AWS_IP     (ocf::suse:aws-vpc-move-ip):    Started dbsec00

Follow-up action: Start the EC2 instance and the standby node should start on dbprim00.

Test Case 3: Stop the Db2 Instance on the Primary Instance

Procedure: Log in to the Db2 primary instance (dbprim00) as db2<sid> (db2stj) and run db2stop force.

     db2stj> db2stop force
     02/12/2020 12:40:03     0   0   SQL1064N  DB2STOP processing was successful.
     SQL1064N  DB2STOP processing was successful.

Expected result: The Db2 primary node will failover on primary instance. The standby remains on the standby instance. There is a failed resource alert.

     dbsec00:~  crm status
     Stack: corosync
     Current DC: dbsec00 (version 1.1.19+20181105.ccd6b5b10-3.16.1-1.1.19+20181105.ccd6b5b10) - partition with quorum
     Last updated: Sat Apr 25 19:29:38 2020
     Last change: Sat Apr 25 19:23:04 2020 by root via crm_resource on dbprim00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     res_AWS_STONITH        (stonith:external/ec2): Started dbprim00
     Master/Slave Set: msl_db2_db2stj_STJ [rsc_db2_db2stj_STJ]
         Masters: [ dbprim00 ]
         Slaves: [ dbsec00 ]
     res_AWS_IP     (ocf::suse:aws-vpc-move-ip):    Started dbprim00

    Failed Resource Actions:
    * rsc_db2_db2stj_STJ_demote_0 on dbprim00 'unknown error' (1): call=74, status=complete, exitreason='',
    last-rc-change='Sat Apr 25 19:27:21 2020', queued=0ms, exec=175ms

Followup action: Clear the failed cluster action.

     dbsec00:~  crm resource cleanup
     Waiting for 1 reply from the CRMd. OK

Test Case 4: End the Db2 Process (db2sysc) on the Node that Runs the Primary Database

Procedure: Log in to the Db2 primary instance as root and run ps –ef|grep db2sysc. Note the PID and then end it.

     dbprim00:~  ps -ef|grep db2sysc
     db2stj   11690 11688  0 19:27 ?        00:00:02 db2sysc 0
     root     15814  4907  0 19:31 pts/0    00:00:00 grep --color=auto db2sysc
     [root@dbprim00 ~] kill -9 11690

Expected result: The Db2 primary node is restarted on the primary instance. The standby node remains on the standby instance. There is a failed resource alert.

     dbsec00:~  crm status
     Stack: corosync
     Current DC: dbsec00 (version 1.1.19+20181105.ccd6b5b10-3.16.1-1.1.19+20181105.ccd6b5b10) - partition with quorum
     Last updated: Sat Apr 25 19:29:38 2020
     Last change: Sat Apr 25 19:23:04 2020 by root via crm_resource on dbprim00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     res_AWS_STONITH        (stonith:external/ec2): Started dbprim00
     Master/Slave Set: msl_db2_db2stj_STJ [rsc_db2_db2stj_STJ]
         Masters: [ dbprim00 ]
         Slaves: [ dbsec00 ]
     res_AWS_IP     (ocf::suse:aws-vpc-move-ip):    Started dbprim00

     Failed Resource Actions:
     * rsc_db2_db2stj_STJ_demote_0 on dbprim00 'unknown error' (1): call=74, status=complete, exitreason='',
    last-rc-change='Sat Apr 25 19:27:21 2020', queued=0ms, exec=175ms

Followup action: Clear the failed cluster action.

     dbsec00:~  crm resource cleanup
     Waiting for 1 reply from the CRMd. OK

Test Case 5: End the Db2 Process (db2sysc) on the Node that Runs the Standby Database

Procedure: Log in to the standby DB instance (dbsec00) as root, then run ps –ef|grep db2sysc. Note the PID and then end it.

     dbsec00:~  ps -ef| grep db2sysc
     db2stj   16245 16243  0 19:23 ?        00:00:04 db2sysc 0
     root     28729 28657  0 19:38 pts/0    00:00:00 grep --color=auto db2sysc
     dbsec00:~  kill -9 16245

Expected result: The db2sysc process is restarted on the standby DB instance. There is a monitoring failure event recorded in the cluster.

     dbsec00:~  crm status
     Stack: corosync
     Current DC: dbsec00 (version 1.1.19+20181105.ccd6b5b10-3.16.1-1.1.19+20181105.ccd6b5b10) - partition with quorum
     Last updated: Sat Apr 25 19:40:23 2020
     Last change: Sat Apr 25 19:23:04 2020 by root via crm_resource on dbprim00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     res_AWS_STONITH        (stonith:external/ec2): Started dbprim00
     Master/Slave Set: msl_db2_db2stj_STJ [rsc_db2_db2stj_STJ]
         Masters: [ dbprim00 ]
         Slaves: [ dbsec00 ]
     res_AWS_IP     (ocf::suse:aws-vpc-move-ip):    Started dbprim00

     Failed Resource Actions:
     * rsc_db2_db2stj_STJ_monitor_30000 on dbsec00 'not running' (7): call=387, status=complete, exitreason='',
    last-rc-change='Sat Apr 25 19:39:24 2020', queued=0ms, exec=0ms

Followup action: Clear the monitoring error.

     dbsec00:~  crm resource cleanup
     Waiting for 1 reply from the CRMd. OK

Test Case 6: Simulating a Crash of the Node that Runs the Primary Db2

Procedure: Log in to the Db2 primary instance as root, then run echo 'c' > /proc/sysrq-trigger.

     dbprim00:~  echo 'c' > /proc/sysrq-trigger
     Session stopped
         - Press <return> to exit tab
         - Press R to restart session
         - Press S to save terminal output to file

     Network error: Software caused connection abort

Expected result: The primary Db2 should failover to standby node.vThe standby is in a stopped state on the previous primary (dbprim00).

     [root@dbsec00 ~] crm status
     Cluster name: db2ha
     Stack: corosync
     Current DC: dbsec00 (version 1.1.18-11.el7_5.4-2b07d5c5a9) - partition with quorum
     Last updated: Fri Feb 21 15:38:43 2020
     Last change: Fri Feb 21 15:33:17 2020 by hacluster via crmd on dbsec00

     2 nodes configured
     4 resources configured

     Online: [ dbprim00 dbsec00 ]

     Full list of resources:

     clusterfence   (stonith:fence_aws):    Started dbsec00
     Master/Slave Set: Db2_HADR_STJ-master [Db2_HADR_STJ]
         Masters: [ dbsec00 ]
         Stopped: [ dbprim00 ]
     db2-oip        (ocf::heartbeat:aws-vpc-move-ip):       Started dbsec00

     Failed Actions:
     * Db2_HADR_STJ_start_0 on dbprim00 'unknown error' (1): call=15, status=complete, exitreason='',
    last-rc-change='Fri Feb 21 15:38:31 2020', queued=0ms, exec=7666ms


     Daemon Status:
        corosync: active/enabled
        pacemaker: active/enabled
        pcsd: active/enabled

Followup action: Start the EC2 instance and then start Db2 as standby on the standby instance as you did in Test Case 2.