Hello all, I'm having a situation here that might be a bug, or maybe it's some mistake from my part. * Scenario: 2-node cluster on Dell PE-2950 servers, Dell MD-3000 storage (SAS direct-attach), using IPMI-Lan as fencing devices, 2 NICs on each server (public and heartbeat networks), using Qdisk in the shared storage * Problem: if I shutdown one node and keep it shut down, and then reboot the other node, although CMAN comes up after 5 minutes or so, rgmanager does not start. I remember having this same problem with RHCS 4.4, but it was solved by upgrading to 4.5. But with RHCS 4.4 CMAN didn't come up, with my setup in RHCS 5.1 CMAN comes up after giving up waiting for the other node, but rgmanager doesn't, so services get not started. This is bad in an unattended situation. Here are some steps and details I've collected from the machine (sorry for a so long message): * Shutdown node1 * Reboot node2 - after boot, took around 5 minutes in the "start fencing" message - reported a startup FAIL for the "cman" service after this period of time * Boot completed * Logged in: - clustat reported inquorate and quorum disk as "offline": [root@mrp02 ~]# clustat msg_open: No such file or directory Member Status: Inquorate Member Name ID Status ------ ---- ---- ------ node1 1 Offline node2 2 Online, Local /dev/sdc1 0 Offline * After a few seconds, clustat reported quorate and quorum disk as "online": [root@mrp02 ~]# clustat msg_open: No such file or directory Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1 1 Offline node2 2 Online, Local /dev/sdc1 0 Online, Quorum Disk * Logs in /var/log/messages showed that after qdiskd assumed "master role", cman reported regaining quorum: Feb 7 20:06:59 mrp02 qdiskd[5854]: <info> Assuming master role Feb 7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate. Refusing connection. Feb 7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection refused Feb 7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate. Refusing connection. Feb 7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection refused Feb 7 20:07:00 mrp02 openais[5714]: [CMAN ] quorum regained, resuming activity Feb 7 20:07:01 mrp02 clurgmgrd[7523]: <notice> Quorum formed, starting ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -> Note that rgmanager started after quorum was regained, but seemed not to work anymore later on (please see below). Feb 7 20:07:01 mrp02 kernel: dlm: no local IP address has been set Feb 7 20:07:01 mrp02 kernel: dlm: cannot start dlm lowcomms -107 * Noticed that in "clustat" there as an error message: -> msg_open: No such file or directory * Checked rgmanager to see if it was related: [root@mrp02 ~]# chkconfig --list rgmanager rgmanager 0:off 1:off 2:on 3:on 4:on 5:on 6:off [root@mrp02 ~]# service rgmanager status clurgmgrd dead but pid file exists * Since rgmanager did not come back by itself, restarted it manually: [root@mrp02 init.d]# service rgmanager restart Starting Cluster Service Manager: dlm: Using TCP for communications [ OK ] * This time clustat did not show the "msg_open" error anymore: [root@mrp02 init.d]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1 1 Offline node2 2 Online, Local /dev/sdc1 0 Online, Quorum Disk * It seems to me that in case of cman regaining quorum after a lost quorum, at least in a initial "no quorum" state, rgmanager is not "woke up" * This setup had no services configured, so I repeated the test configuring a simple start/stop/status service using the "crond" service as an example, same results * Copy /etc/cluster/cluster.conf: -> Notice: I'm using Qdiskd with "always ok" heuristics, since the customer does not have a "always-on" IP tiebraker device to use with a "ping" command as heuristics. <?xml version="1.0"?> <cluster config_version="4" name="clu_mrp"> <quorumd interval="1" label="clu_mrp" min_score="1" tko="30" votes="1"> <heuristic interval="2" program="/bin/true" score="1"/> </quorumd> <fence_daemon post_fail_delay="40" post_join_delay="3"/> <clusternodes> <clusternode name="node1" nodeid="1" votes="1"> <fence> <method name="1"> <device lanplus="1" name="node1-ipmi"/> </method> </fence> </clusternode> <clusternode name="node2" nodeid="2" votes="1"> <fence> <method name="1"> <device lanplus="1" name="node2-ipmi"/> </method> </fence> </clusternode> </clusternodes> <cman deadnode_timer="38"/> <fencedevices> <fencedevice agent="fence_ipmilan" auth="none" ipaddr="node1-ipmi" login="root" name="node1-ipmi" passwd="xxx"/> <fencedevice agent="fence_ipmilan" auth="none" ipaddr="node2-ipmi" login="root" name="node2-ipmi" passwd="xxx"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> Could someone tell me with this is a expected behaviour? Shouldn't rgmanager start up automatically in this case? Thank you all, Celso. -- *Celso Kopp Webber* celso@xxxxxxxxxxxxxxxx <mailto:celso@xxxxxxxxxxxxxxxx> *Webbertek - Opensource Knowledge* (41) 8813-1919 - celular (41) 4063-8448, ramal 102 - fixo -- Esta mensagem foi verificada pelo sistema de antivírus e acredita-se estar livre de perigo. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster