Dealing with RDQM Cluster Loss – Part 2 – Restoration

Dealing with RDQM Cluster Loss – Part 2 – Restoration

By Neil Casey

This two part blog series walks you through how to deal with a RDQM cluster loss. Here is the second part that talks about the restoration stage.

 

INTRODUCTION


Part one of this two-part series discussed options for RDQM disaster recovery and presented a method of restoring queue manager data and logs. The end result was a running stand-alone queue manager which could stand in for the RDQM cluster queue manager until the cluster could be restored.

This second part documents recovering the cluster node members from a backup, removing each node from the cluster, rebuilding the RDQM Cluster, and finally restoring the queue manager to running as an RDQM queue manager.

Some of the material here is based on the IBM Knowledge Center chapter “Migrating a queue manager to become an HA RDQM queue manager”.

RESTORE SERVERS AND REBUILD CLUSTER

This procedure assumes that backups exist for the cluster members, whether they are virtual machines or bare metal servers. It also assumes that you have existing site procedures to restore those backups onto appropriate replacements VMs or servers.

If your site does not have backups available then you will need to build new servers (or VMs). You will then have to build a new RDQM cluster from scratch and follow the IBM instructions to migrate the stand-alone queue manager to the RDQM cluster.

If you have backups of your failed RDQM cluster servers available, then the following procedure can be used to bring them online and rebuild the cluster.

The process begins with the queue manager running on the original survivor node from the RDQM cluster. This node still has settings like the rdqm.ini file in place, but the cluster has been unconfigured.

The procedure is a sequence of steps which should be performed carefully, checking each one for successful completion.

  1. Restore 1 (and only 1) other cluster node
  2. Delete the queue manager definition from the restored node
  3. Delete the cluster definition from the restored node
  4. Restore the second failed cluster node
  5. Delete the queue manager definition from the seconds restored node
  6. Delete the cluster definition from the second restored node
  7. Create the cluster
  8. Stop the replacement queue manager
  9. Create a backup of the log and queue files of the replacement queue manager
  10. Review the qm.ini file of the RDQM queue manager
  11. Delete the replacement queue manager
  12. Create the HA queue manager (on the survivor node where the backup is)
  13. Stop the HA queue manager
  14. Delete the qmgr and log directories from the DRBD mount
  15. Restore the qmgr and log directories from the backups taken in step 9.
  16. Fix the log path in the qm.ini file
  17. Start the queue manager
  18. Fix the SSLKEYR location in the queue manager configuration
  19. Set the preferred node of the queue manager (normally to node 1).

In the following instructions all commands are performed as the root user. When the ‘mqm’ account should perform the action, ‘su’ is used to switch accounts.

RESTORE 1 (AND ONLY 1) OTHER CLUSTER NODE

Use your site VM or bare metal server recovery procedure to restore 1 original RDQM cluster node from the failed site. Only one node should be restored as the now out of date queue manager data are still available on the nodes, and if two nodes were restored, a quorum could be formed and the queue manager started with obsolete state.

DELETE THE QUEUE MANAGER DEFINITION FROM THE RESTORED NODE

This step deletes the queue manager information from the mqs.ini file and removes the LVM logical volume containing the obsolete queue manager data and logs. The logical volume name is the same as that recovered in part 1, but can be checked using lvdisplay.

[root@rdqm1 ~]# su - mqm -c "rmvmqinf RDQMTEST"
IBM MQ configuration information removed.
[root@rdqm1 ~]# lvremove -f /dev/drbdpool/rdqmtest_00
  Logical volume "rdqmtest_00" successfully removed
[root@rdqm1 ~]#

DELETE THE CLUSTER DEFINITION FROM THE RESTORED NODE

Remove the node from the cluster so that it can be part of the rebuilt cluster once all nodes have been restored and prepared.

[root@rdqm1 ~]# rdqmadm -u
Unconfiguring the replicated data subsystem.
The replicated data subsystem has been unconfigured on this node.
The replicated data subsystem should be unconfigured on all replicated data
nodes.
You need to run the following command on nodes 'rdqm2.localdomain' and
'rdqm3.localdomain':
rdqmadm -u
[root@rdqm1 ~]#

RESTORE THE SECOND FAILED CLUSTER NODE

This step restores the second failed RDQM cluster node using your site procedures.

DELETE THE QUEUE MANAGER DEFINITION FROM THE SECONDS RESTORED NODE.

[root@rdqm2 ~]# su - mqm -c "rmvmqinf RDQMTEST"
IBM MQ configuration information removed.
[root@rdqm2 ~]# lvremove -f /dev/drbdpool/rdqmtest_00
  Logical volume "rdqmtest_00" successfully removed
[root@rdqm2 ~]#

DELETE THE CLUSTER DEFINITION FROM THE SECOND RESTORED NODE

[root@rdqm2 ~]# rdqmadm -u
Unconfiguring the replicated data subsystem.
The replicated data subsystem has been unconfigured on this node.
The replicated data subsystem should be unconfigured on all replicated data
nodes.
You need to run the following command on nodes 'rdqm1.localdomain' and
'rdqm3.localdomain':
rdqmadm -u
[root@rdqm2 ~]#

CREATE THE CLUSTER

root@rdqm2 ~]# su - mqm -c "rdqmadm -c"
Configuring the replicated data subsystem.
The replicated data subsystem has been configured on this node.
The replicated data subsystem has been configured on 'rdqm1.localdomain'.
The replicated data subsystem has been configured on 'rdqm3.localdomain'.
The replicated data subsystem configuration is complete.
Command '/opt/mqm/bin/rdqmadm' run with sudo.
[root@rdqm2 ~]#

The cluster is now active again, so the following steps migrate the recovered stand-alone queue manager instance back onto the RDQM cluster. The rest of the steps are performed on the node which survived the disaster (rdqm3 in the case of this example).

STOP THE REPLACEMENT QUEUE MANAGER

[root@rdqm3 bin]# su - mqm -c "endmqm -w RDQMTEST"
Waiting for queue manager 'RDQMTEST' to end.
IBM MQ queue manager 'RDQMTEST' ended.
[root@rdqm3 bin]#

CREATE A BACKUP OF THE LOG AND QUEUE FILES OF THE REPLACEMENT QUEUE MANAGER

The backup files will be stored on the /var/mqm volume, so ensure that there is enough space available first.

[root@rdqm3 bin]# mkdir -p /var/mqm/backups
[root@rdqm3 bin]# tar -C /var/mqm/log/ -cvf /var/mqm/backups/RDQMTEST.log.tar RDQMTEST
RDQMTEST/
RDQMTEST/amqhlctl.lfh
RDQMTEST/active/
RDQMTEST/active/S0000002.LOG
RDQMTEST/active/S0000001.LOG
RDQMTEST/active/S0000000.LOG
[root@rdqm3 bin]# tar -C /var/mqm/qmgrs/ -cvf /var/mqm/backups/RDQMTEST.qmgr.tar RDQMTEST
RDQMTEST/
RDQMTEST/qm.ini
RDQMTEST/qmstatus.ini
RDQMTEST/blockaddr.ini
RDQMTEST/startprm/
...
RDQMTEST/namelist/
RDQMTEST/namelist/SYSTEM!DEFAULT!NAMELIST
RDQMTEST/namelist/SYSTEM!QPUBSUB!QUEUE!NAMELIST
RDQMTEST/namelist/SYSTEM!QPUBSUB!SUBPOINT!NAMELIST
[root@rdqm3 bin]#

REVIEW THE QM.INI FILE OF THE RDQM QUEUE MANAGER

[root@rdqm3 bin]# cat /var/mqm/qmgrs//RDQMTEST/qm.ini
...
Log:
   LogPrimaryFiles=3
   LogSecondaryFiles=2
   LogFilePages=4096
   LogType=LINEAR
   LogBufferPages=0
   LogPath=/var/mqm/log/RDQMTEST/
   LogWriteIntegrity=TripleWrite
   LogManagement=Automatic
Service:
   Name=AuthorizationService
   EntryPoints=14
   SecurityPolicy=User
...
[root@rdqm3 bin]#

DELETE THE REPLACEMENT QUEUE MANAGER

[root@rdqm3 bin]# su - mqm -c "dltmqm RDQMTEST"
IBM MQ queue manager 'RDQMTEST' deleted.
[root@rdqm3 bin]#

CREATE THE HA QUEUE MANAGER (ON THE SURVIVOR NODE WHERE THE BACKUP IS)

Just like creating the stand-alone replacement queue manager, you need to use values for all parameters that match the original configuration of the queue manager. Ensure that the filesystem size (-fs parameter) is sufficient (ideally at least value used for the original RDQM queue manager).

crtmqm parameter qm.ini value name
-lp LogPrimaryFiles
-ls LogSecondaryFiles
-lf LogFilePages
-ll, or -lc LogType
-ll or -lla LogManagement
-oa SecurityPolicy
[root@rdqm3 bin]# su - mqm -c "crtmqm -c \"RDQM TEST QMGR\" -oa user -t 60000 -u SYSTEM.DEAD.LETTER.QUEUE -lp 3 -ls 2 -lf 4096 -lla -sx -fs 1500M RDQMTEST"
Creating replicated data queue manager configuration.
Secondary queue manager created on 'rdqm1.localdomain'.
Secondary queue manager created on 'rdqm2.localdomain'.
IBM MQ queue manager created.
Directory '/var/mqm/vols/rdqmtest/qmgr/rdqmtest' created.
The queue manager is associated with installation 'Installation1'.
Creating or replacing default objects for queue manager 'RDQMTEST'.
Default objects statistics : 83 created. 0 replaced. 0 failed.
Completing setup.
Setup completed.
Enabling replicated data queue manager.
Replicated data queue manager enabled.
Command '/opt/mqm/bin/crtmqm' run with sudo.
[root@rdqm3 bin]#

STOP THE HA QUEUE MANAGER

[root@rdqm3 bin]# su - mqm -c "endmqm -i RDQMTEST"
Replicated data queue manager disabled.
IBM MQ queue manager 'RDQMTEST' ending.
IBM MQ queue manager 'RDQMTEST' ended.
[root@rdqm3 bin]#

DELETE THE QMGR AND LOG DIRECTORIES FROM THE DRBD MOUNT

[root@rdqm3 bin]# rm -rf /var/mqm/vols/rdqmtest/log/rdqmtest
[root@rdqm3 bin]# rm -rf /var/mqm/vols/rdqmtest/qmgr/rdqmtest
[root@rdqm3 bin]#

RESTORE THE QMGR AND LOG DIRECTORIES FROM THE BACKUPS TAKEN IN STEP 9.

[root@rdqm3 bin]# tar -C /var/mqm/vols/rdqmtest/log -xvpf /var/mqm/backups/RDQMTEST.log.tar
RDQMTEST/
RDQMTEST/amqhlctl.lfh
RDQMTEST/active/
RDQMTEST/active/S0000002.LOG
RDQMTEST/active/S0000001.LOG
RDQMTEST/active/S0000000.LOG
[root@rdqm3 bin]# mv /var/mqm/vols/rdqmtest/log/RDQMTEST /var/mqm/vols/rdqmtest/log/rdqmtest/
[root@rdqm3 bin]# tar -C /var/mqm/vols/rdqmtest/qmgr -xvpf /var/mqm/backups/RDQMTEST.qmgr.tar
RDQMTEST/
RDQMTEST/qm.ini
RDQMTEST/qmstatus.ini
RDQMTEST/blockaddr.ini
RDQMTEST/startprm/
...
RDQMTEST/namelist/
RDQMTEST/namelist/SYSTEM!DEFAULT!NAMELIST
RDQMTEST/namelist/SYSTEM!QPUBSUB!QUEUE!NAMELIST
RDQMTEST/namelist/SYSTEM!QPUBSUB!SUBPOINT!NAMELIST
[root@rdqm3 bin]# mv /var/mqm/vols/rdqmtest/qmgr/RDQMTEST 
[root@rdqm3 bin]#

FIX THE LOG PATH IN THE QM.INI FILE

[root@rdqm3 bin]# vi /var/mqm/vols/rdqmtest/qmgr/rdqmtest/qm.ini
[root@rdqm3 bin]#

New log path is /var/mqm/vols/rdqmtest/log/rdqmtest/

START THE QUEUE MANAGER

[root@rdqm3 bin]# su - mqm -c "strmqm RDQMTEST"
IBM MQ queue manager 'RDQMTEST' starting.
The queue manager is associated with installation 'Installation1'.
6 log records accessed on queue manager 'RDQMTEST' during the log replay phase.
Log replay for queue manager 'RDQMTEST' complete.
Transaction manager state recovered for queue manager 'RDQMTEST'.
Replicated data queue manager enabled.
IBM MQ queue manager 'RDQMTEST' started using V9.2.0.1.
[root@rdqm3 bin]#

FIX THE SSLKEYR LOCATION IN THE QUEUE MANAGER CONFIGURATION

The new ssl key location is /var/mqm/vols/rdqmtest/qmgr/rdqmtest/ssl/key.

[root@rdqm3 bin]# su - mqm -c "echo -e \"alter qmgr sslkeyr('/var/mqm/vols/rdqmtest/qmgr/rdqmtest/ssl/key')\nrefresh security(*) type(ssl)\nend\n\" | runmqsc RDQMTEST"
5724-H72 (C) Copyright IBM Corp. 1994, 2020.
Starting MQSC for queue manager RDQMTEST.


     1 : alter qmgr sslkeyr('/var/mqm/vols/rdqmtest/qmgr/rdqmtest/ssl/key')
AMQ8005I: IBM MQ queue manager changed.
     2 : refresh security(*) type(ssl)
AMQ8560I: IBM MQ security cache refreshed.
     3 : end
2 MQSC commands read.
No commands have a syntax error.
All valid MQSC commands were processed.
[root@rdqm3 bin]#

SET THE PREFERRED NODE OF THE QUEUE MANAGER (NORMALLY TO NODE 1).

This step is optional. You may wish to have no preferred location at all, or to leave the preferred location where it is. The instructions move the preferred location back to node 1.

[root@rdqm3 bin]# su - mqm -c "rdqmadm -p -m RDQMTEST -d"
The preferred replicated data node has been unset for queue manager 'RDQMTEST'.
[root@rdqm3 bin]# su - mqm -c "rdqmadm -p -m RDQMTEST -n rdqm1.localdomain"
The preferred replicated data node has been set to 'rdqm1.localdomain' for
queue manager 'RDQMTEST'.
[root@rdqm3 bin]#

SUMMARY

In this short blog series, we have recovered a queue manager from a failed RDQM cluster part 1 and then restored full HA cluster operation (in part 2).

For sites where the disaster recovery time objective (RTO) permits, this allows for using RDQM for HA without needing a second RDQM DR cluster too.

 

WE WOULD LOVE TO HEAR FROM YOU!
Let us know what you are after, and our team will get
in touch to help address your organisation’s API,
Integration and Security challenges.