Another test run that manage 52 hours before hitting a cman bug: cl032: Dec 18 19:56:05 cl032 kernel: CMAN: bad generation number 10 in HELLO message, expected 9 Dec 18 19:56:06 cl032 kernel: CMAN: killed by STARTTRANS or NOMINATE Dec 18 19:56:06 cl032 kernel: CMAN: we are leaving the cluster. Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 2 Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 3 Dec 18 19:56:07 cl032 kernel: SM: 00000001 sm_stop: SG still joined Dec 18 19:56:07 cl032 kernel: SM: 0100081e sm_stop: SG still joined Dec 18 19:56:07 cl032 kernel: SM: 0200081f sm_stop: SG still joined cl031: Dec 18 19:56:02 cl031 kernel: CMAN: node cl032a is not responding - removing from the cluster Dec 18 19:56:06 cl031 kernel: CMAN: Being told to leave the cluster by node 1 Dec 18 19:56:06 cl031 kernel: CMAN: we are leaving the cluster. Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 2 Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 3 Dec 18 19:56:07 cl031 kernel: SM: 00000001 sm_stop: SG still joined Dec 18 19:56:07 cl031 kernel: SM: 0100081e sm_stop: SG still joined cl030: Dec 18 19:56:05 cl030 kernel: CMAN: bad generation number 10 in HELLO message, expected 9 Dec 18 19:56:06 cl030 kernel: CMAN: Node cl031a is leaving the cluster, Shutdown Dec 18 19:56:06 cl030 kernel: CMAN: quorum lost, blocking activity Looks like cl032 had the most problems. It hit a bug of asserts: $ grep BUG cl032.messages Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:400! Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342! Dec 18 20:01:06 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342! Dec 18 20:01:07 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342! Questions: Any ideas on what is going on here? How does one know what the current "generation" number is? When CMAN gets an error, it is not shutting down all the cluster software correctly. GFS is still mounted and anything accessing it is hung. For debugging it is ok for the machine to stay up so we can figure out what is going on, but for a real operational cluster this is very bad. In normal operation, if the cluster hits a bug likes this shouldn't it just reboot, so hopefully all the other nodes can recover? Is there more debugging that can be turned on, so we can figure out what is going on? The full info is available here: http://developer.osdl.org/daniel/GFS/cman.18dec2004/ Thanks, Daniel