Dealing with RDQM Cluster Loss - Part 1 - Recovery

Dealing with RDQM Cluster Loss - Part 1 - Recovery

By Neil Casey

This two part blog series walks you through how to deal with a RDQM cluster loss. Here is the first part that talks about the recovery stage.

Part 1 - Recovery

Introduction


IBM MQ v9.0.4 introduced a new capability when running on RedHat Enterprise Linux. Replicated Data Queue Manager (RDQM) provides high availability by replicating the queue manager log and data files to independent file systems on separate cluster nodes. The underlying technology includes drbd and pacemaker. This paper discusses RDQM at MQ v9.2.

RDQM has received many enhancements through the 9.0 and 9.1 Continuous Delivery process, and now at v9.2 includes several capabilities:

  • Synchronous replication in a cluster providing high availability (HA) with three nodes
  • Synchronous or asynchronous replication providing a disaster recovery (DR) capability with two nodes
  • Combined HA and DR using two clusters each consisting of three nodes (a total of six nodes).

The HA options allow automatic failover to another node in the cluster providing only a single node has failed. If more than one node fails, then the cluster no longer has a quorum and the queue manager cannot run.

A quorum means that a majority of the nodes are still active and connected to each other. That is, two out of three are still alive.

The DR options allows for an administrator to confirm that the normal active node or cluster has failed, and to manually start the queue manager on the DR node or cluster.

The combined HA/DR capability provides excellent availability and durability of the persistent messages under the control of the queue manager but requires a significant investment in infrastructure. This can be justified when there are many queue managers which will run spread across the six nodes. It could be expensive when an organisation just requires one or two queue managers but needs the HA and DR capability.

One option for reducing the node count for providing HA and DR is to run the HA cluster with three nodes spread across two nearby data centres.

The data centres must be located quite closely together as performance of replication decreases exponentially with network latency between the sites. Reasonable performance can be achieved with latency up to 1 or 2 ms. IBM supports configurations with latency up to 5 ms but performance will suffer.

Two of the RDQM cluster nodes are placed in one data centre (DC-A), while the third node is at the second data centre (DC-B).

With this environment, the impact of the loss of a data centre is not symmetrical. If DC-B is lost, the queue manager continues to run. Quorum is maintained among the two nodes at DC-A. The only impact is that DR capability is lost until DC-B is recovered. However, if DC-A is lost, the cluster loses quorum and any queue manager running on the cluster will not run.

The data has been synchronously replicated to the cluster node in DC-B though, so this paper describes a way to recover that data and recreate the lost queue manager. We only discuss a single queue manager, but the steps can be reused to recover further queue managers within the RDQM cluster. Part 2 will show how to restore the queue manager to running in an RDQM cluster.

Recover a queue manager when quorum is lost


When only one of the three RDQM cluster nodes survives a site disaster, the clustering software will not mount the replicated data volume, and the RDQM queue manager will not run.

This paper describes how to mount the volume and recover the data, and then create a replacement queue manager to stand in for the HA RDQM queue manager until the original site is recovered and the RDQM cluster can be recreated.

Our queue manager is called RDQMTEST. It was created using the following command:

$ crtmqm -c "Test RDQM recovery" -sx -fs 1500M -lla -lp 3 -ls 2 -lf 4096 RDQMTEST

It is important to know how the queue manager was created before we try to create a replacement which will reuse the recovered files. If the original build instructions for the queue manager are not available, the critical logging information can be obtained from the Log: and Service: stanzas of the qm.ini file of the RDQM queue manager.

The recovery process takes several steps which must be performed carefully and in order.

  1. On the surviving node, mount the DRBD filesystem in read only mode
  2. Create a backup of the log and qmgr directories of the queue manager
  3. Review the qm.ini file of the RDQM queue manager
  4. Unmount the DRBD filesystem
  5. Delete the queue manager definition
  6. Delete the drbd logical volume for the queue manager
  7. Delete the cluster definition from the surviving node
  8. Create replacement queue manager on local storage
  9. Delete the log and queue files for the local queue manager
  10. Restore the log and queue files from the HA queue manager backups (from step 2)
  11. Fix the log path in the replaced qm.ini file
  12. Start the replacement queue manager to roll forward the logs and complete recovery
  13. Fix the SSLKEYR location in the queue manager configuration.

In the instructions, commands entered by the user are shown in bold text. Normal weight text shows output from the system.

Ellipses (…) are used to reduce the volume of output and indicate that some lines of text have been removed.

On the surviving node, mount the DRBD filesystem in read only mode.

Run the lvdisplay command to view the device details of the drbd volume:

[root@rdqm3 ~]# lvdisplay
  --- Logical volume ---
  LV Path                /dev/drbdpool/rdqmtest_00
  LV Name                rdqmtest_00
  VG Name                drbdpool
  LV UUID                tgJJJT-yLrJ-GgCq-mNBE-hPLg-hf2y-HbrHpG
  LV Write Access        read/write
  LV Creation host, time rdqm3.localdomain, 2020-12-02 18:21:55 -0800
  LV Status              available
  # open                 2
  LV Size                1.46 GiB
  Current LE             375
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:0
   
[root@rdqm3 ~]#

Take note of the LV Path.

Create a directory to act as a mount point for the drbd volume.

[root@rdqm3 ~]# mkdir /mnt/mqtmp
[root@rdqm3 ~]#

Mount the filesystem (read only) to the temporary mount point.

[root@rdqm3 ~]# mount -o ro /dev/drbdpool/rdqmtest_00 /mnt/mqtmp
[root@rdqm3 ~]#

Check that the mounted volume contains the expected directories.

[root@rdqm3 ~]# ls -l /mnt/mqtmp
total 24
drwxrwsr-x. 3 mqm  mqm   4096 Dec  2 18:38 log
drwx------. 2 root root 16384 Dec  2 18:21 lost+found
drwxrwsr-x. 3 mqm  mqm   4096 Dec  2 18:38 qmgr
[root@rdqm3 ~]#

Create a backup of the log and qmgr directories of the queue manager

Although we could just copy the data and log files directly out of the mounted volume to a local disk, we will instead create tar files and unmount the volume before creating the replacement queue manager.

We do need to check that there is enough space on our local system to hold the queue manager data and log files. Mount a new volume at /var/mqm if there is not sufficient space available.

[root@rdqm3 ~]# df -h /var/mqm
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p3   18G   11G  7.5G  58% /
[root@rdqm3 ~]# df -h /mnt/mqtmp
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/drbdpool-rdqmtest_00  1.5G   54M  1.3G   4% /mnt/mqtmp
[root@rdqm3 ~]#

On my system, the queue manager is only using 54MB, and the target (/var/mqm) has 7.5GB free. There is plenty of space for two copies of the data and logs.

Create a directory on /var/mqm to hold the backups

[root@rdqm3 ~]# mkdir /var/mqm/backups
[root@rdqm3 ~]#

Back up the queue manager log and data files from the drbd volume to tar archive files on /var/mqm/backups

[root@rdqm3 ~]# tar -C /mnt/mqtmp/log/ -cvf /var/mqm/backups/rdqmtest.log.tar rdqmtest
rdqmtest/
rdqmtest/amqhlctl.lfh
rdqmtest/active/
rdqmtest/active/S0000002.LOG
rdqmtest/active/S0000001.LOG
rdqmtest/active/S0000000.LOG
[root@rdqm3 ~]# tar -C /mnt/mqtmp/qmgr/ -cvf /var/mqm/backups/rdqmtest.qmgr.tar rdqmtest
rdqmtest/
rdqmtest/qm.ini
rdqmtest/qmstatus.ini
...
rdqmtest/namelist/SYSTEM!DEFAULT!NAMELIST
rdqmtest/namelist/SYSTEM!QPUBSUB!QUEUE!NAMELIST
rdqmtest/namelist/SYSTEM!QPUBSUB!SUBPOINT!NAMELIST
[root@rdqm3 ~]#

Review the qm.ini file of the RDQM queue manager

[mqm@rdqm3 rdqmtest]$ cat /mnt/mqtmp/qmgr/rdqmtest/qm.ini
...
Log:
   LogPrimaryFiles=3
   LogSecondaryFiles=2
   LogFilePages=4096
   LogType=LINEAR
   LogBufferPages=0
   LogPath=/var/mqm/vols/rdqmtest/log/rdqmtest/
   LogWriteIntegrity=TripleWrite
   LogManagement=Automatic
Service:
   Name=AuthorizationService
   EntryPoints=14
   SecurityPolicy=User
...
[mqm@rdqm3 rdqmtest]$

unmount the DRBD filesystem

[root@rdqm3 ~]# umount /mnt/mqtmp
[root@rdqm3 ~]# rmdir /mnt/mqtmp

Delete the queue manager definition

[root@rdqm3 ~]# su - mqm -c "rmvmqinf RDQMTEST"
IBM MQ configuration information removed.
[root@rdqm3 ~]#

Delete the drbd logical volume for the queue manager

[root@rdqm3 ~]# lvremove -f /dev/drbdpool/rdqmtest_00
  Logical volume "rdqmtest_00" successfully removed
[root@rdqm3 ~]#

Delete the cluster definition from the surviving node

[root@rdqm3 ~]# rdqmadm -u
Unconfiguring the replicated data subsystem.
The replicated data subsystem has been unconfigured on this node.
The replicated data subsystem should be unconfigured on all replicated data
nodes.
You need to run the following command on nodes 'rdqm1.localdomain' and
'rdqm2.localdomain':
rdqmadm -u
[root@rdqm3 ~]#

Create replacement queue manager on local storage

The command to create the replacement queue manager must use the same parameters that were current on the RDQM queue manager other than RDQM related values. Extract the information for the parameters from the qm.ini file displayed previously:

crtmqm parameter       qm.ini value name      
-lp LogPrimaryFiles
-ls LogSecondaryFiles
-lf LogFilePages
-ll, or -lc LogType
-ll or -lla LogManagement
-oa SecurityPolicy
[root@rdqm3 ~]# su - mqm -c "crtmqm -c \"RDQM Recovery Test QMGR\" -oa user -t 60000 -u SYSTEM.DEAD.LETTER.QUEUE -lp 3 -ls 2 -lf 4096 -lla RDQMTEST"
IBM MQ queue manager created.
Directory '/var/mqm/qmgrs/RDQMTEST' created.
The queue manager is associated with installation 'Installation1'.
Creating or replacing default objects for queue manager 'RDQMTEST'.
Default objects statistics : 83 created. 0 replaced. 0 failed.
Completing setup.
Setup completed.
[root@rdqm3 ~]#

Delete the log and queue files for the local queue manager

[root@rdqm3 ~]# rm -rf /var/mqm/qmgrs/RDQMTEST/
[root@rdqm3 ~]# rm -rf /var/mqm/log/RDQMTEST/
[root@rdqm3 ~]#

Restore the log and queue files from the HA queue manager backups (from step 2)

[root@rdqm3 ~]# tar -C /var/mqm/log/ -xpvf /var/mqm/backups/rdqmtest.log.tar 
rdqmtest/
rdqmtest/amqhlctl.lfh
rdqmtest/active/
rdqmtest/active/S0000002.LOG
rdqmtest/active/S0000001.LOG
rdqmtest/active/S0000000.LOG
[root@rdqm3 ~]# mv /var/mqm/log/rdqmtest/ /var/mqm/log/RDQMTEST/
[root@rdqm3 ~]# tar -C /var/mqm/qmgrs/ -xpvf /var/mqm/backups/rdqmtest.qmgr.tar 
rdqmtest/
rdqmtest/qm.ini
rdqmtest/qmstatus.ini
rdqmtest/blockaddr.ini
...
rdqmtest/namelist/
rdqmtest/namelist/SYSTEM!DEFAULT!NAMELIST
rdqmtest/namelist/SYSTEM!QPUBSUB!QUEUE!NAMELIST
rdqmtest/namelist/SYSTEM!QPUBSUB!SUBPOINT!NAMELIST
[root@rdqm3 ~]# mv /var/mqm/qmgrs/rdqmtest/ /var/mqm/qmgrs/RDQMTEST/
[root@rdqm3 ~]#

Fix the log path in the replaced qm.ini file

[root@rdqm3 ~]# vi /var/mqm/qmgrs/RDQMTEST/qm.ini

New log path is /var/mqm/log/RDQMTEST/

Start the replacement queue manager to roll forward the logs and complete recovery

[root@rdqm3 ~]# su - mqm -c "strmqm RDQMTEST"
IBM MQ queue manager 'RDQMTEST' starting.
The queue manager is associated with installation 'Installation1'.
11 log records accessed on queue manager 'RDQMTEST' during the log replay phase.
Log replay for queue manager 'RDQMTEST' complete.
Transaction manager state recovered for queue manager 'RDQMTEST'.
IBM MQ queue manager 'RDQMTEST' started using V9.2.0.1.
[root@rdqm3 ~]#

Fix the SSLKEYR location in the queue manager configuration.

The new ssl key location is /var/mqm/qmgrs/RDQMTEST/ssl/key.

[root@rdqm3 ~]# su - mqm -c "echo -e \"alter qmgr sslkeyr('/var/mqm/qmgrs/RDQMTEST/ssl/key')\nrefresh security(*) type(ssl)\nend\n\" | runmqsc RDQMTEST"
5724-H72 (C) Copyright IBM Corp. 1994, 2020.
Starting MQSC for queue manager RDQMTEST.


     1 : alter qmgr sslkeyr('/var/mqm/qmgrs/RDQMTEST/ssl/key')
AMQ8005I: IBM MQ queue manager changed.
     2 : refresh security(*) type(ssl)
AMQ8560I: IBM MQ security cache refreshed.
     3 : end
2 MQSC commands read.
No commands have a syntax error.
All valid MQSC commands were processed.
[root@rdqm3 ~]#

Results

The queue manager is available again. It is running as a single instance queue manager on the surviving cluster node. The queue manager id and persistent messages have been preserved. The queue manager no longer provides high availability though. Restoring high availability by rebuilding the cluster and migrating the stand-alone queue manager into the RDQM cluster will be covered in Part 2.


Get in contact to see how Syntegrity can help you.

Neil Casey is a Senior Consultant and IBM Champion for IBM MQ. With a solid background in enterprise computing and integration, Neil designs and delivers uncomplicated, powerful and robust solutions. With a focus on security and performance, Neil focuses on finding technically elegant solutions to the most challenging integration problems.

Have a question?