[Linux-cluster] cman bad generation number

Daniel McNeil <daniel@xxxxxxxx> · Tue, 21 Dec 2004 10:34:41 -0800

Another test run that manage 52 hours before hitting a cman bug:

cl032:
Dec 18 19:56:05 cl032 kernel: CMAN: bad generation number 10 in HELLO message, expected 9
Dec 18 19:56:06 cl032 kernel: CMAN: killed by STARTTRANS or NOMINATE
Dec 18 19:56:06 cl032 kernel: CMAN: we are leaving the cluster.
Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 2
Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 3
Dec 18 19:56:07 cl032 kernel: SM: 00000001 sm_stop: SG still joined
Dec 18 19:56:07 cl032 kernel: SM: 0100081e sm_stop: SG still joined
Dec 18 19:56:07 cl032 kernel: SM: 0200081f sm_stop: SG still joined

cl031:
Dec 18 19:56:02 cl031 kernel: CMAN: node cl032a is not responding - removing from the cluster
Dec 18 19:56:06 cl031 kernel: CMAN: Being told to leave the cluster by node 1
Dec 18 19:56:06 cl031 kernel: CMAN: we are leaving the cluster.
Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 2
Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 3
Dec 18 19:56:07 cl031 kernel: SM: 00000001 sm_stop: SG still joined
Dec 18 19:56:07 cl031 kernel: SM: 0100081e sm_stop: SG still joined

cl030:
Dec 18 19:56:05 cl030 kernel: CMAN: bad generation number 10 in HELLO message, expected 9
Dec 18 19:56:06 cl030 kernel: CMAN: Node cl031a is leaving the cluster, Shutdown
Dec 18 19:56:06 cl030 kernel: CMAN: quorum lost, blocking activity

Looks like cl032 had the most problems.  It hit a bug of asserts:
$ grep BUG cl032.messages
Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:400!
Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!
Dec 18 20:01:06 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!
Dec 18 20:01:07 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!

Questions:
Any ideas on what is going on here?

How does one know what the current "generation" number is?

When CMAN gets an error, it is not shutting down all the cluster
software correctly.  GFS is still mounted and anything accessing
it is hung.  For debugging it is ok for the machine to stay up
so we can figure out what is going on, but for a real operational
cluster this is very bad.  In normal operation, if the cluster
hits a bug likes this shouldn't it just reboot, so hopefully
all the other nodes can recover?

Is there more debugging that can be turned on, so we can figure
out what is going on?

The full info is available here:
http://developer.osdl.org/daniel/GFS/cman.18dec2004/

Thanks,

Daniel