Re: [Linux-cluster] cman bad generation number

Daniel McNeil <daniel@xxxxxxxx> · Wed, 22 Dec 2004 09:33:39 -0800

On Wed, 2004-12-22 at 01:08, Patrick Caulfield wrote:
> On Tue, Dec 21, 2004 at 10:34:41AM -0800, Daniel McNeil wrote:
> > Another test run that manage 52 hours before hitting a cman bug:
> > 
> > cl032:
> > Dec 18 19:56:05 cl032 kernel: CMAN: bad generation number 10 in HELLO message, expected 9
> > Dec 18 19:56:06 cl032 kernel: CMAN: killed by STARTTRANS or NOMINATE
> > Dec 18 19:56:06 cl032 kernel: CMAN: we are leaving the cluster.
> > Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 2
> > Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 3
> > Dec 18 19:56:07 cl032 kernel: SM: 00000001 sm_stop: SG still joined
> > Dec 18 19:56:07 cl032 kernel: SM: 0100081e sm_stop: SG still joined
> > Dec 18 19:56:07 cl032 kernel: SM: 0200081f sm_stop: SG still joined
> > 
> > cl031:
> > Dec 18 19:56:02 cl031 kernel: CMAN: node cl032a is not responding - removing from the cluster
> > Dec 18 19:56:06 cl031 kernel: CMAN: Being told to leave the cluster by node 1
> > Dec 18 19:56:06 cl031 kernel: CMAN: we are leaving the cluster.
> > Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 2
> > Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 3
> > Dec 18 19:56:07 cl031 kernel: SM: 00000001 sm_stop: SG still joined
> > Dec 18 19:56:07 cl031 kernel: SM: 0100081e sm_stop: SG still joined
> > 
> > cl030:
> > Dec 18 19:56:05 cl030 kernel: CMAN: bad generation number 10 in HELLO message, expected 9
> > Dec 18 19:56:06 cl030 kernel: CMAN: Node cl031a is leaving the cluster, Shutdown
> > Dec 18 19:56:06 cl030 kernel: CMAN: quorum lost, blocking activity
> > 
> > Looks like cl032 had the most problems.  It hit a bug of asserts:
> > $ grep BUG cl032.messages
> > Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:400!
> > Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!
> > Dec 18 20:01:06 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!
> > Dec 18 20:01:07 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!
> > 
> > Questions:
> > Any ideas on what is going on here?
> > 
> > How does one know what the current "generation" number is?
> 
> You don't, cman does. it's the current "generation" of the cluster which is
> incremented for each state transition. Are you taking nodes up and down during
> these tests??

The nodes are staying up.  I am mounting and umounting a lot.
Any reason to not add generation /proc/cluster/status?  (it would help
debugging at least).

> 
> It does seem that cman is susceptible to heavy network traffic, despite my best
> efforts to increase its priority. I'm going to check in a change that will allow
> you to change the retry count byt it's a bit of a hack really.
> 
>  
> > When CMAN gets an error, it is not shutting down all the cluster
> > software correctly.  GFS is still mounted and anything accessing
> > it is hung.  For debugging it is ok for the machine to stay up
> > so we can figure out what is going on, but for a real operational
> > cluster this is very bad.  In normal operation, if the cluster
> > hits a bug likes this shouldn't it just reboot, so hopefully
> > all the other nodes can recover?
> 
> If you have power switch fencing and the remainder of the node is quorate then
> surely the failed node should be powercycled?
>  

I currently have it set up for manual fencing and I have yet to see that
work correctly.  This was a 3 node cluster.  cl032 got the bad
generation number and cman was "killed by STARTTRANS or NOMINATE"
cl030 got a bad generation number (but stayed up) and cl031 leaves
the cluster because it says cl030 told it to.  So that leaves me
with 1 node up without quorum.  I did not see any fencing messages.

Should the surviving node (cl030) have attempted fencing or does
it only do that if it has quorum?

I do not seem to be able to keep cman up for much past 2 days if 
I have my tests running.  (it stays up with no load, of course).
My tests are not the complicated currently either.  Just tar, du
and rm in separate directories from 1, 2 and then 3 nodes
simultaneously.  Who knows what will happen if I add tests
to cause lots of dlm lock conflict.
How long does cman stay up in your testing?

Daniel