On Mon, Sep 24, 2007 at 05:33:30PM +0200, Borgstr?m Jonas wrote: > Hi, > > I think there might be some race condition in the cman init script > causing fenced to stop working correctly. > I'm able to reliably reproduce the problem using problem using a minimal > cluster.conf with two nodes and fence_manual fencing. > > Steps to reproduce: > 1. Install cluster.conf on two nodes, enable the "cman" service and > reboot both nodes. > 2. The cluster boots successfully and clustat lists both nodes as online. > 3. Power-cycle node prod-db1. > 4. On prod-db2 openais detects the missing node but fenced decides to do > nothing about it and logs nothing to /var/log/messages (But the fenced > process is still running) > > Output from "group_tool dump fence" After the test: > > [root@prod-db2 ~]# group_tool dump fence > 1190645583 our_nodeid 2 our_name prod-db2 > 1190645583 listen 4 member 5 groupd 7 > 1190645584 client 3: join default > 1190645584 delay post_join 120s post_fail 0s > 1190645584 added 2 nodes from ccs > 1190645584 setid default 65538 > 1190645584 start default 1 members 2 > 1190645584 do_recovery stop 0 start 1 finish 0 > 1190645584 node "prod-db1" not a cman member, cn 1 > 1190645584 add first victim prod-db1 > 1190645585 node "prod-db1" not a cman member, cn 1 > 1190645586 node "prod-db1" not a cman member, cn 1 > 1190645587 node "prod-db1" not a cman member, cn 1 > 1190645588 node "prod-db1" not a cman member, cn 1 > 1190645589 node "prod-db1" not a cman member, cn 1 > 1190645590 node "prod-db1" not a cman member, cn 1 > 1190645591 node "prod-db1" not a cman member, cn 1 > 1190645592 node "prod-db1" not a cman member, cn 1 > 1190645593 node "prod-db1" not a cman member, cn 1 > 1190645594 node "prod-db1" not a cman member, cn 1 > 1190645595 node "prod-db1" not a cman member, cn 1 > 1190645596 node "prod-db1" not a cman member, cn 1 > 1190645597 node "prod-db1" not a cman member, cn 1 > 1190645598 node "prod-db1" not a cman member, cn 1 > 1190645599 node "prod-db1" not a cman member, cn 1 > 1190645600 reduce victim prod-db1 > 1190645600 delay of 16s leaves 0 victims > 1190645600 finish default 1 > 1190645600 stop default > 1190645600 start default 2 members 1 2 > 1190645600 do_recovery stop 1 start 2 finish 1 I think something has gone wrong here, either in groupd or fenced, that's preventing this start from finishing (we don't get a 'finish default 2' which we expect). A 'group_tool -v' here should show the state of the fence group still in transition. Could you run that, plus a 'group_tool dump' at this point, in addition to the 'dump fence' you have. And please run those commands on both nodes. > 1190645954 client 3: dump <--- Before killing prod-db1 > 1190645985 stop default > 1190645985 start default 3 members 2 > 1190645985 do_recovery stop 2 start 3 finish 1 > 1190645985 finish default 3 > 1190646008 client 3: dump <--- After killing prod-db1 Node 1 isn't fenced here because it never completed joining the fence group above. > The scary part is that as far as I can tell fenced is the only cman > daemon being affected by this. So your cluster appears to work fine. But > when a node needs to be fenced the operation it isn't carried out and > that can cause gfs filesystem corruption. You shouldn't be able to mount gfs on the node where joining the fence group is stuck. Thanks for the informative report. Dave -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster