Dealing with RDQM Cluster Loss – Part 2 – Restoration
By Neil Casey
This two part blog series walks you through how to deal with a RDQM cluster loss. Here is the second part that talks about the restoration stage.
INTRODUCTION
Part one of this two-part series discussed options for RDQM disaster recovery and presented a method of restoring queue manager data and logs. The end result was a running stand-alone queue manager which could stand in for the RDQM cluster queue manager until the cluster could be restored.
This second part documents recovering the cluster node members from a backup, removing each node from the cluster, rebuilding the RDQM Cluster, and finally restoring the queue manager to running as an RDQM queue manager.
Some of the material here is based on the IBM Knowledge Center chapter “Migrating a queue manager to become an HA RDQM queue manager”.
RESTORE SERVERS AND REBUILD CLUSTER
This procedure assumes that backups exist for the cluster members, whether they are virtual machines or bare metal servers. It also assumes that you have existing site procedures to restore those backups onto appropriate replacements VMs or servers.
If your site does not have backups available then you will need to build new servers (or VMs). You will then have to build a new RDQM cluster from scratch and follow the IBM instructions to migrate the stand-alone queue manager to the RDQM cluster.
If you have backups of your failed RDQM cluster servers available, then the following procedure can be used to bring them online and rebuild the cluster.
The process begins with the queue manager running on the original survivor node from the RDQM cluster. This node still has settings like the rdqm.ini file in place, but the cluster has been unconfigured.
The procedure is a sequence of steps which should be performed carefully, checking each one for successful completion.
- Restore 1 (and only 1) other cluster node
- Delete the queue manager definition from the restored node
- Delete the cluster definition from the restored node
- Restore the second failed cluster node
- Delete the queue manager definition from the seconds restored node
- Delete the cluster definition from the second restored node
- Create the cluster
- Stop the replacement queue manager
- Create a backup of the log and queue files of the replacement queue manager
- Review the qm.ini file of the RDQM queue manager
- Delete the replacement queue manager
- Create the HA queue manager (on the survivor node where the backup is)
- Stop the HA queue manager
- Delete the qmgr and log directories from the DRBD mount
- Restore the qmgr and log directories from the backups taken in step 9.
- Fix the log path in the qm.ini file
- Start the queue manager
- Fix the SSLKEYR location in the queue manager configuration
- Set the preferred node of the queue manager (normally to node 1).
In the following instructions all commands are performed as the root user. When the ‘mqm’ account should perform the action, ‘su’ is used to switch accounts.
RESTORE 1 (AND ONLY 1) OTHER CLUSTER NODE
Use your site VM or bare metal server recovery procedure to restore 1 original RDQM cluster node from the failed site. Only one node should be restored as the now out of date queue manager data are still available on the nodes, and if two nodes were restored, a quorum could be formed and the queue manager started with obsolete state.
DELETE THE QUEUE MANAGER DEFINITION FROM THE RESTORED NODE
This step deletes the queue manager information from the mqs.ini file and removes the LVM logical volume containing the obsolete queue manager data and logs. The logical volume name is the same as that recovered in part 1, but can be checked using lvdisplay.
[root@rdqm1 ~]# su - mqm -c "rmvmqinf RDQMTEST" IBM MQ configuration information removed. [root@rdqm1 ~]# lvremove -f /dev/drbdpool/rdqmtest_00 Logical volume "rdqmtest_00" successfully removed [root@rdqm1 ~]#
DELETE THE CLUSTER DEFINITION FROM THE RESTORED NODE
Remove the node from the cluster so that it can be part of the rebuilt cluster once all nodes have been restored and prepared.
[root@rdqm1 ~]# rdqmadm -u Unconfiguring the replicated data subsystem. The replicated data subsystem has been unconfigured on this node. The replicated data subsystem should be unconfigured on all replicated data nodes. You need to run the following command on nodes 'rdqm2.localdomain' and 'rdqm3.localdomain': rdqmadm -u [root@rdqm1 ~]#
RESTORE THE SECOND FAILED CLUSTER NODE
This step restores the second failed RDQM cluster node using your site procedures.
DELETE THE QUEUE MANAGER DEFINITION FROM THE SECONDS RESTORED NODE.
[root@rdqm2 ~]# su - mqm -c "rmvmqinf RDQMTEST" IBM MQ configuration information removed. [root@rdqm2 ~]# lvremove -f /dev/drbdpool/rdqmtest_00 Logical volume "rdqmtest_00" successfully removed [root@rdqm2 ~]#
DELETE THE CLUSTER DEFINITION FROM THE SECOND RESTORED NODE
[root@rdqm2 ~]# rdqmadm -u Unconfiguring the replicated data subsystem. The replicated data subsystem has been unconfigured on this node. The replicated data subsystem should be unconfigured on all replicated data nodes. You need to run the following command on nodes 'rdqm1.localdomain' and 'rdqm3.localdomain': rdqmadm -u [root@rdqm2 ~]#
CREATE THE CLUSTER
root@rdqm2 ~]# su - mqm -c "rdqmadm -c" Configuring the replicated data subsystem. The replicated data subsystem has been configured on this node. The replicated data subsystem has been configured on 'rdqm1.localdomain'. The replicated data subsystem has been configured on 'rdqm3.localdomain'. The replicated data subsystem configuration is complete. Command '/opt/mqm/bin/rdqmadm' run with sudo. [root@rdqm2 ~]#
The cluster is now active again, so the following steps migrate the recovered stand-alone queue manager instance back onto the RDQM cluster. The rest of the steps are performed on the node which survived the disaster (rdqm3 in the case of this example).
STOP THE REPLACEMENT QUEUE MANAGER
[root@rdqm3 bin]# su - mqm -c "endmqm -w RDQMTEST" Waiting for queue manager 'RDQMTEST' to end. IBM MQ queue manager 'RDQMTEST' ended. [root@rdqm3 bin]#
CREATE A BACKUP OF THE LOG AND QUEUE FILES OF THE REPLACEMENT QUEUE MANAGER
The backup files will be stored on the /var/mqm volume, so ensure that there is enough space available first.
[root@rdqm3 bin]# mkdir -p /var/mqm/backups [root@rdqm3 bin]# tar -C /var/mqm/log/ -cvf /var/mqm/backups/RDQMTEST.log.tar RDQMTEST RDQMTEST/ RDQMTEST/amqhlctl.lfh RDQMTEST/active/ RDQMTEST/active/S0000002.LOG RDQMTEST/active/S0000001.LOG RDQMTEST/active/S0000000.LOG [root@rdqm3 bin]# tar -C /var/mqm/qmgrs/ -cvf /var/mqm/backups/RDQMTEST.qmgr.tar RDQMTEST RDQMTEST/ RDQMTEST/qm.ini RDQMTEST/qmstatus.ini RDQMTEST/blockaddr.ini RDQMTEST/startprm/ ... RDQMTEST/namelist/ RDQMTEST/namelist/SYSTEM!DEFAULT!NAMELIST RDQMTEST/namelist/SYSTEM!QPUBSUB!QUEUE!NAMELIST RDQMTEST/namelist/SYSTEM!QPUBSUB!SUBPOINT!NAMELIST [root@rdqm3 bin]#
REVIEW THE QM.INI FILE OF THE RDQM QUEUE MANAGER
[root@rdqm3 bin]# cat /var/mqm/qmgrs//RDQMTEST/qm.ini ... Log: LogPrimaryFiles=3 LogSecondaryFiles=2 LogFilePages=4096 LogType=LINEAR LogBufferPages=0 LogPath=/var/mqm/log/RDQMTEST/ LogWriteIntegrity=TripleWrite LogManagement=Automatic Service: Name=AuthorizationService EntryPoints=14 SecurityPolicy=User ... [root@rdqm3 bin]#
DELETE THE REPLACEMENT QUEUE MANAGER
[root@rdqm3 bin]# su - mqm -c "dltmqm RDQMTEST" IBM MQ queue manager 'RDQMTEST' deleted. [root@rdqm3 bin]#
CREATE THE HA QUEUE MANAGER (ON THE SURVIVOR NODE WHERE THE BACKUP IS)
Just like creating the stand-alone replacement queue manager, you need to use values for all parameters that match the original configuration of the queue manager. Ensure that the filesystem size (-fs parameter) is sufficient (ideally at least value used for the original RDQM queue manager).
crtmqm parameter | qm.ini value name |
---|---|
-lp | LogPrimaryFiles |
-ls | LogSecondaryFiles |
-lf | LogFilePages |
-ll, or -lc | LogType |
-ll or -lla | LogManagement |
-oa | SecurityPolicy |
[root@rdqm3 bin]# su - mqm -c "crtmqm -c \"RDQM TEST QMGR\" -oa user -t 60000 -u SYSTEM.DEAD.LETTER.QUEUE -lp 3 -ls 2 -lf 4096 -lla -sx -fs 1500M RDQMTEST" Creating replicated data queue manager configuration. Secondary queue manager created on 'rdqm1.localdomain'. Secondary queue manager created on 'rdqm2.localdomain'. IBM MQ queue manager created. Directory '/var/mqm/vols/rdqmtest/qmgr/rdqmtest' created. The queue manager is associated with installation 'Installation1'. Creating or replacing default objects for queue manager 'RDQMTEST'. Default objects statistics : 83 created. 0 replaced. 0 failed. Completing setup. Setup completed. Enabling replicated data queue manager. Replicated data queue manager enabled. Command '/opt/mqm/bin/crtmqm' run with sudo. [root@rdqm3 bin]#
STOP THE HA QUEUE MANAGER
[root@rdqm3 bin]# su - mqm -c "endmqm -i RDQMTEST" Replicated data queue manager disabled. IBM MQ queue manager 'RDQMTEST' ending. IBM MQ queue manager 'RDQMTEST' ended. [root@rdqm3 bin]#
DELETE THE QMGR AND LOG DIRECTORIES FROM THE DRBD MOUNT
[root@rdqm3 bin]# rm -rf /var/mqm/vols/rdqmtest/log/rdqmtest [root@rdqm3 bin]# rm -rf /var/mqm/vols/rdqmtest/qmgr/rdqmtest [root@rdqm3 bin]#
RESTORE THE QMGR AND LOG DIRECTORIES FROM THE BACKUPS TAKEN IN STEP 9.
[root@rdqm3 bin]# tar -C /var/mqm/vols/rdqmtest/log -xvpf /var/mqm/backups/RDQMTEST.log.tar RDQMTEST/ RDQMTEST/amqhlctl.lfh RDQMTEST/active/ RDQMTEST/active/S0000002.LOG RDQMTEST/active/S0000001.LOG RDQMTEST/active/S0000000.LOG [root@rdqm3 bin]# mv /var/mqm/vols/rdqmtest/log/RDQMTEST /var/mqm/vols/rdqmtest/log/rdqmtest/ [root@rdqm3 bin]# tar -C /var/mqm/vols/rdqmtest/qmgr -xvpf /var/mqm/backups/RDQMTEST.qmgr.tar RDQMTEST/ RDQMTEST/qm.ini RDQMTEST/qmstatus.ini RDQMTEST/blockaddr.ini RDQMTEST/startprm/ ... RDQMTEST/namelist/ RDQMTEST/namelist/SYSTEM!DEFAULT!NAMELIST RDQMTEST/namelist/SYSTEM!QPUBSUB!QUEUE!NAMELIST RDQMTEST/namelist/SYSTEM!QPUBSUB!SUBPOINT!NAMELIST [root@rdqm3 bin]# mv /var/mqm/vols/rdqmtest/qmgr/RDQMTEST [root@rdqm3 bin]#
FIX THE LOG PATH IN THE QM.INI FILE
[root@rdqm3 bin]# vi /var/mqm/vols/rdqmtest/qmgr/rdqmtest/qm.ini [root@rdqm3 bin]#
New log path is /var/mqm/vols/rdqmtest/log/rdqmtest/
START THE QUEUE MANAGER
[root@rdqm3 bin]# su - mqm -c "strmqm RDQMTEST" IBM MQ queue manager 'RDQMTEST' starting. The queue manager is associated with installation 'Installation1'. 6 log records accessed on queue manager 'RDQMTEST' during the log replay phase. Log replay for queue manager 'RDQMTEST' complete. Transaction manager state recovered for queue manager 'RDQMTEST'. Replicated data queue manager enabled. IBM MQ queue manager 'RDQMTEST' started using V9.2.0.1. [root@rdqm3 bin]#
FIX THE SSLKEYR LOCATION IN THE QUEUE MANAGER CONFIGURATION
The new ssl key location is /var/mqm/vols/rdqmtest/qmgr/rdqmtest/ssl/key.
[root@rdqm3 bin]# su - mqm -c "echo -e \"alter qmgr sslkeyr('/var/mqm/vols/rdqmtest/qmgr/rdqmtest/ssl/key')\nrefresh security(*) type(ssl)\nend\n\" | runmqsc RDQMTEST" 5724-H72 (C) Copyright IBM Corp. 1994, 2020. Starting MQSC for queue manager RDQMTEST. 1 : alter qmgr sslkeyr('/var/mqm/vols/rdqmtest/qmgr/rdqmtest/ssl/key') AMQ8005I: IBM MQ queue manager changed. 2 : refresh security(*) type(ssl) AMQ8560I: IBM MQ security cache refreshed. 3 : end 2 MQSC commands read. No commands have a syntax error. All valid MQSC commands were processed. [root@rdqm3 bin]#
SET THE PREFERRED NODE OF THE QUEUE MANAGER (NORMALLY TO NODE 1).
This step is optional. You may wish to have no preferred location at all, or to leave the preferred location where it is. The instructions move the preferred location back to node 1.
[root@rdqm3 bin]# su - mqm -c "rdqmadm -p -m RDQMTEST -d" The preferred replicated data node has been unset for queue manager 'RDQMTEST'. [root@rdqm3 bin]# su - mqm -c "rdqmadm -p -m RDQMTEST -n rdqm1.localdomain" The preferred replicated data node has been set to 'rdqm1.localdomain' for queue manager 'RDQMTEST'. [root@rdqm3 bin]#
SUMMARY
In this short blog series, we have recovered a queue manager from a failed RDQM cluster part 1 and then restored full HA cluster operation (in part 2).
For sites where the disaster recovery time objective (RTO) permits, this allows for using RDQM for HA without needing a second RDQM DR cluster too.