On Thu, 2007-10-04 at 13:28 -0300, Celso K. Webber wrote: > ** What is happening: > If I boot node1 with node2 powered off, it stops for 5 minutes during the > start of ccsd, and after that it regains quorum, qdiskd starts successfully, > but fenced keeps trying to start for 2 minutes and then it gives up with > a "failed" message. Hmmm... > Oct 4 11:56:45 hercules01 lock_gulmd: no <gulm> section detected > in /etc/cluster/cluster.conf succeeded chkconfig --del lock_gulmd > Oct 4 11:56:45 hercules01 qdiskd: Starting the Quorum Disk Daemon: succeeded > Oct 4 11:57:02 hercules01 kernel: CMAN: quorum regained, resuming activity > Oct 4 11:57:02 hercules01 ccsd[9144]: Cluster is quorate. Allowing > connections. > Oct 4 11:58:45 hercules01 fenced: startup failed > ^^^^^^^^ > exactly 2 minutes after the qdiskd message above, I've noticed that > fenced is started in the init scripts with "fence_tool -t 120 join -w" > Oct 4 11:59:38 hercules01 rgmanager: clurgmgrd startup failed > ^^^^^^^^ > after other service boot up ok, rgmanager fails to boot, probably > because fenced failed to start > Oct 4 11:56:45 hercules01 qdiskd[9292]: <info> Quorum Daemon Initializing > Oct 4 11:56:55 hercules01 qdiskd[9292]: <info> Initial score 1/1 > Oct 4 11:56:55 hercules01 qdiskd[9292]: <info> Initialization complete > Oct 4 11:56:55 hercules01 qdiskd[9292]: <notice> Score sufficient for > master operation (1/1; required=1); upgrading > Oct 4 11:57:01 hercules01 qdiskd[9292]: <info> Assuming master role [ at this point, the cluster is quorate ] > ** Installed cluster package versions (same on both nodes): > cman-kernel-smp-2.6.9-45.15.x86_64.rpm >From cvs logs for cnxman.c (I know, too much information... but I can't find a bugzilla on it): RHEL45: 1.42.2.28.0.2 cman-kernel_2_6_9_48: 1.42.2.27 ... cman-kernel_2_6_9_45: 1.42.2.25 revision 1.42.2.27 date: 2007/01/19 10:23:14; author: pcaulfield; state: Exp; lines: +2 -0 Tell SM when the quorum device comes or goes. There's a bug in the one you have which is fixed in 4.5. Basically, the SM component of CMAN in the kernel wasn't getting notified when qdisk votes were causing a quorum transition. This caused problems with fenced and the DLM (and thus, rgmanager - since rgmanager uses the DLM). It's fixed in cman-kernel from 4.5 and later. The patch is here: http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/cman-kernel/src/Attic/cnxman.c.diff?r1=1.42.2.26&r2=1.42.2.27&cvsroot=cluster&hideattic=0&only_with_tag=RHEL4 -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster