On Fri, Sep 28, 2007 at 09:58:18AM -0500, David Teigland wrote: > On Fri, Sep 28, 2007 at 04:48:18PM +0200, Borgstr?m Jonas wrote: > > I must have misunderstood you or something, but didn't I already include > > that info in the message I sent a few days ago? > > > > http://permalink.gmane.org/gmane.linux.redhat.cluster/9999 > > > > (The archive inlines the "group_tool dump" output making it a bit hard > > to read, but hopefully your email client shows them as attachments). > > I missed that, I'll take a look, thanks. You've hit a known bug that's been fixed: https://bugzilla.redhat.com/show_bug.cgi?id=251966 We may have to move up the release of that fix since people are seeing the problem. Be careful when reading that bz because there's a lot of incorrect diagnosis that was recorded before we figured out what the real bug was. Here's the problem, it's very complex: 1. when the nodes start up, they each form a 1-node openais cluster independent of the other [This shouldn't really happen, but in reality we can't prevent it 100% of the time. We try to make it rare, and then deal with it sensibly on the rare occasion when it does happen. You've hit the "rare" occasion -- if you're actually seeing this regularly then we probably need to fix or adjust something at the openais level to make it less common.] 2. fence_tool join is run on each node which creates group state in both clusters 3. The two clusters then merge together. We could handle this merging *if* there had been no group activity yet (in this case from fenced). But, in this case, divergent group state exists in the two clusters that we can't combine. Cman (above openais) should recognize this [*] and continue to treat the nodes separately, even though openais has merged them together. [*] In RHEL5.0, cman/groupd are *not* smart enough to recognize this. The fix in bz 251966 makes cman/groupd recognize this condition by introducing a "dirty flag". What you observe, is groupd trying to merge the divergent state, getting confused and stuck. After the bug is fixed, what you should observe is the two nodes will stay separate (in cman) and will try to fence each other. One will win the fencing race and reboot the other. When the rebooted node returns, it should properly join the existing cluster. Dave -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster