Hello, I forgot to add some versioning information from the Cluster packages, here they are: * Main cluster packages: cman-2.0.73-1.el5_1.1.x86_64.rpm openais-0.80.3-7.el5.x86_64.rpm perl-Net-Telnet-3.03-5.noarch.rpm * Admin tools packages: Cluster_Administration-en-US-5.1.0-7.noarch.rpm cluster-cim-0.10.0-5.el5_1.1.x86_64.rpm cluster-snmp-0.10.0-5.el5_1.1.x86_64.rpm luci-0.10.0-6.el5.x86_64.rpm modcluster-0.10.0-5.el5_1.1.x86_64.rpm rgmanager-2.0.31-1.el5.x86_64.rpm ricci-0.10.0-6.el5.x86_64.rpm system-config-cluster-1.0.50-1.3.noarch.rpm tog-pegasus-2.6.1-2.el5_1.1.*.rpm oddjob-*.rpm Thank you, Celso. On Fri, 8 Feb 2008 11:18:20 -0200, Celso K. Webber wrote > Hello all, > > I'm having a situation here that might be a bug, or maybe it's some mistake > from my part. > > * Scenario: 2-node cluster on Dell PE-2950 servers, Dell MD-3000 > storage (SAS direct-attach), using IPMI-Lan as fencing devices, 2 > NICs on each server > (public and heartbeat networks), using Qdisk in the shared storage > > * Problem: if I shutdown one node and keep it shut down, and then > reboot the other node, although CMAN comes up after 5 minutes or so, > rgmanager does not start. > > I remember having this same problem with RHCS 4.4, but it was solved > by upgrading to 4.5. But with RHCS 4.4 CMAN didn't come up, with my > setup in RHCS > 5.1 CMAN comes up after giving up waiting for the other node, but rgmanager > doesn't, so services get not started. This is bad in an unattended situation. > > Here are some steps and details I've collected from the machine > (sorry for a so long message): > > * Shutdown node1 > > * Reboot node2 > - after boot, took around 5 minutes in the "start fencing" message > - reported a startup FAIL for the "cman" service after this period > of time > > * Boot completed > > * Logged in: > - clustat reported inquorate and quorum disk as "offline": > [root@mrp02 ~]# clustat > msg_open: No such file or directory > Member Status: Inquorate > > Member Name ID Status > ------ ---- ---- ------ > node1 1 Offline > node2 2 Online, Local > /dev/sdc1 0 Offline > > * After a few seconds, clustat reported quorate and quorum disk as "online": > [root@mrp02 ~]# clustat > msg_open: No such file or directory > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > node1 1 Offline > node2 2 Online, Local > /dev/sdc1 0 Online, Quorum Disk > > * Logs in /var/log/messages showed that after qdiskd assumed "master > role", cman reported regaining quorum: > Feb 7 20:06:59 mrp02 qdiskd[5854]: <info> Assuming master role > Feb 7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate. Refusing connection. > > Feb 7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection > refused > > Feb 7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate. Refusing connection. > > Feb 7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection > refused > > Feb 7 20:07:00 mrp02 openais[5714]: [CMAN ] quorum regained, > resuming activity > Feb 7 20:07:01 mrp02 clurgmgrd[7523]: <notice> Quorum formed, starting > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > -> Note that rgmanager started after quorum was regained, but > seemed not to work anymore later on (please see below). > Feb 7 20:07:01 mrp02 kernel: dlm: no local IP address has been set > Feb 7 20:07:01 mrp02 kernel: dlm: cannot start dlm lowcomms -107 > > * Noticed that in "clustat" there as an error message: > -> msg_open: No such file or directory > > * Checked rgmanager to see if it was related: > [root@mrp02 ~]# chkconfig --list rgmanager > rgmanager 0:off 1:off 2:on 3:on 4:on 5:on 6:off > [root@mrp02 ~]# service rgmanager status > clurgmgrd dead but pid file exists > > * Since rgmanager did not come back by itself, restarted it manually: > [root@mrp02 init.d]# service rgmanager restart > Starting Cluster Service Manager: dlm: Using TCP for communications > [ OK ] > > * This time clustat did not show the "msg_open" error anymore: > [root@mrp02 init.d]# clustat > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > node1 1 Offline > node2 2 Online, Local > /dev/sdc1 0 Online, Quorum Disk > > * It seems to me that in case of cman regaining quorum after a lost > quorum, at least in a initial "no quorum" state, rgmanager is not > "woke up" > > * This setup had no services configured, so I repeated the test > configuring a simple start/stop/status service using the "crond" > service as an example, same results > > * Copy /etc/cluster/cluster.conf: > -> Notice: I'm using Qdiskd with "always ok" heuristics, since the > customer does not have a "always-on" IP tiebraker device to use with > a "ping" command as heuristics. <?xml version="1.0"?> <cluster > config_version="4" name="clu_mrp"> <quorumd interval="1" > label="clu_mrp" min_score="1" tko="30" votes="1"> <heuristic > interval="2" program="/bin/true" score="1"/> </quorumd> > <fence_daemon post_fail_delay="40" post_join_delay="3"/> <clusternodes> > <clusternode name="node1" nodeid="1" votes="1"> > <fence> > <method name="1"> > <device lanplus="1" name="node1-ipmi"/> > </method> > </fence> > </clusternode> > <clusternode name="node2" nodeid="2" votes="1"> > <fence> > <method name="1"> > <device lanplus="1" name="node2-ipmi"/> > </method> > </fence> > </clusternode> > </clusternodes> > <cman deadnode_timer="38"/> > <fencedevices> > <fencedevice agent="fence_ipmilan" auth="none" ipaddr="node1-ipmi" > login="root" name="node1-ipmi" passwd="xxx"/> > <fencedevice agent="fence_ipmilan" auth="none" ipaddr="node2-ipmi" > login="root" name="node2-ipmi" passwd="xxx"/> > </fencedevices> > <rm> > <failoverdomains/> > <resources/> > </rm> > </cluster> > > Could someone tell me with this is a expected behaviour? Shouldn't rgmanager > start up automatically in this case? > > Thank you all, > > Celso. > > -- > *Celso Kopp Webber* > > celso@xxxxxxxxxxxxxxxx <mailto:celso@xxxxxxxxxxxxxxxx> > > *Webbertek - Opensource Knowledge* > (41) 8813-1919 - celular > (41) 4063-8448, ramal 102 - fixo > > -- > Esta mensagem foi verificada pelo sistema de antivírus e > acredita-se estar livre de perigo. > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- *Celso Kopp Webber* celso@xxxxxxxxxxxxxxxx <mailto:celso@xxxxxxxxxxxxxxxx> *Webbertek - Opensource Knowledge* (41) 8813-1919 - celular (41) 4063-8448, ramal 102 - fixo -- Esta mensagem foi verificada pelo sistema de antivírus e acredita-se estar livre de perigo. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster