-----Original Message----- From: David Teigland [mailto:teigland@xxxxxxxxxx] Sent: den 28 september 2007 19:03 To: Borgström Jonas Cc: linux clustering Subject: Re: Possible cman init script race condition > On Fri, Sep 28, 2007 at 11:45:47AM -0500, David Teigland wrote: > > On Fri, Sep 28, 2007 at 09:58:18AM -0500, David Teigland wrote: > > > On Fri, Sep 28, 2007 at 04:48:18PM +0200, Borgstr?m Jonas wrote: > > > > I must have misunderstood you or something, but didn't I already include > > > > that info in the message I sent a few days ago? > > > > > > > > http://permalink.gmane.org/gmane.linux.redhat.cluster/9999 > > > > > > > > (The archive inlines the "group_tool dump" output making it a bit hard > > > > to read, but hopefully your email client shows them as attachments). > > > > > > I missed that, I'll take a look, thanks. > > > > You've hit a known bug that's been fixed: > > https://bugzilla.redhat.com/show_bug.cgi?id=251966 > > > > We may have to move up the release of that fix since people are seeing the > > problem. Be careful when reading that bz because there's a lot of > > incorrect diagnosis that was recorded before we figured out what the real > > bug was. Here's the problem, it's very complex: > > > > 1. when the nodes start up, they each form a 1-node openais cluster > > independent of the other > > > > [This shouldn't really happen, but in reality we can't prevent it > > 100% of the time. We try to make it rare, and then deal with it > > sensibly on the rare occasion when it does happen. You've hit > > the "rare" occasion -- if you're actually seeing this regularly > > then we probably need to fix or adjust something at the openais > > level to make it less common.] > > I'd try to use some sleeps here, before running fence_tool join on either > node, as a work-around. We're trying to get both nodes merged together > before they do anything else. Strangely enough adding a "sleep 30" line directly below the "echo "Starting cluster: "" line seems to make this problem go away every time. Note that this is before any daemon is started. It works, but I'm not sure why. > > Also, how often are you seeing the nodes not merge together right away? > If it's frequent, then we need to fix that. This happens every time on this hardware (2 Dell 1955 blades). I never got fenced to work correctly until I figured out that I need to add a sleep 30 to the cman init script. So I'm obviously very interested in seeing this fixed in a 5.0 errata or in 5.1 at the very latest. I can't really wait until 5.2 is out... And as I mentioned before, the really scary part is that I am able to mount gfs filesystems during this kind of cluster split. And if I one node is shot, the other node replays the gfs journal and makes the filesystem writable again without first fencing the shot/missing node. Here some "group_tool -v" output with a mounted filesystem: [root@prod-db2 pgsql]# group_tool -v type level name id state node id local_done fence 0 default 00010002 JOIN_START_WAIT 1 100020001 1 [1 2] dlm 1 clvmd 00020001 JOIN_START_WAIT 1 100020001 1 [1 2] dlm 1 pg_fs 00060001 JOIN_START_WAIT 1 100020001 1 [1 2] gfs 2 pg_fs 00050001 JOIN_START_WAIT 1 100020001 1 [1 2] Regards, Jonas -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster