Re: [Linux-cluster] cman bad generation number

Daniel McNeil <daniel@xxxxxxxx> · Tue, 04 Jan 2005 14:46:17 -0800

On Tue, 2005-01-04 at 03:29, Patrick Caulfield wrote:
> On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote:
> > > > 
> > > > How does one know what the current "generation" number is?
> > > 
> > > You don't, cman does. it's the current "generation" of the cluster which is
> > > incremented for each state transition. Are you taking nodes up and down during
> > > these tests??
> > 
> > The nodes are staying up.  I am mounting and umounting a lot.
> > Any reason to not add generation /proc/cluster/status?  (it would help
> > debugging at least).
> 
> No reason at all not to, apart from I really don't think it will tell anyone
> anything useful. The cause of the problem is that the CMAN heartbeat messages
> are being lost on the network flooded by lock traffic. generation mismatches are
> just a symptom of that.
>  

One thing I do not understand is that I am leaving the nodes in the
cluster and just doing mounting and umounting, so the generation number
should not be changing.

I think you are saying the the lock traffic is so high that the heart
are lost so the node being kicked out is seeing the new heart beat
from the other nodes and doesn't know they are not receiving his
heartbeat messages.  This node must be seeing the other nodes
heartbeat messages or it would have started a membership transition
without the other nodes.  Do I have this right?

Shouldn't the heartbeat messages have higher priority
over the lock traffic messages? 

Shouldn't there be a way of throttling back the lock traffic and seeing
if heartbeat connection can be re-established before starting a
membership transition?

Daniel