Anyone with an idea why a "sleep 30" is needed for fenced to be able to join the fence group properly? Even though this workaround appears to work it would be nice to have a more solid solution. Since now I will need to remember to patch the init script every time it's updated. Regards, Jonas -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas Sent: den 24 september 2007 19:29 To: David Teigland Cc: linux clustering Subject: RE: Possible cman init script race condition From: David Teigland [mailto:teigland@xxxxxxxxxx] Sent: den 24 september 2007 18:10 To: Borgström Jonas Cc: linux clustering Subject: Re: Possible cman init script race condition *snip* > > 1190645596 node "prod-db1" not a cman member, cn 1 > > 1190645597 node "prod-db1" not a cman member, cn 1 > > 1190645598 node "prod-db1" not a cman member, cn 1 > > 1190645599 node "prod-db1" not a cman member, cn 1 1190645600 reduce > > victim prod-db1 1190645600 delay of 16s leaves 0 victims 1190645600 > > finish default 1 1190645600 stop default 1190645600 start default 2 > > members 1 2 1190645600 do_recovery stop 1 start 2 finish 1 > > I think something has gone wrong here, either in groupd or fenced, > that's preventing this start from finishing (we don't get a 'finish default 2' > which we expect). A 'group_tool -v' here should show the state of the > fence group still in transition. Could you run that, plus a > 'group_tool dump' at this point, in addition to the 'dump fence' you > have. And please run those commands on both nodes. > Hi david, thanks for your fast response. Here's the output you requested: [root@prod-db1 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_START_WAIT 2 200020001 1 [1 2] [root@prod-db2 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010002 JOIN_START_WAIT 1 100020001 1 [1 2] I attached "group_tool dump" output as files, since they are quite long. > > 1190645954 client 3: dump <--- Before killing prod-db1 > > 1190645985 stop default > > 1190645985 start default 3 members 2 > > 1190645985 do_recovery stop 2 start 3 finish 1 > > 1190645985 finish default 3 > > 1190646008 client 3: dump <--- After killing prod-db1 > > Node 1 isn't fenced here because it never completed joining the fence > group above. > > > The scary part is that as far as I can tell fenced is the only cman > > daemon being affected by this. So your cluster appears to work fine. > > But when a node needs to be fenced the operation it isn't carried > > out and that can cause gfs filesystem corruption. > > You shouldn't be able to mount gfs on the node where joining the fence > group is stuck. My current setup is very stripped down so I haven't configured gfs. But on my original setup where I initially noticed this issue I had no problem mounting gfs filesystems and after a simulated network failure I could still continue to write to the filesystem from both nodes since no node was fenced, and that quickly corrupted the filesystem. Regards, Jonas -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster