On Wed, 2004-12-22 at 01:08, Patrick Caulfield wrote: > On Tue, Dec 21, 2004 at 10:34:41AM -0800, Daniel McNeil wrote: > > Another test run that manage 52 hours before hitting a cman bug: > > > > cl032: > > Dec 18 19:56:05 cl032 kernel: CMAN: bad generation number 10 in HELLO message, expected 9 > > Dec 18 19:56:06 cl032 kernel: CMAN: killed by STARTTRANS or NOMINATE > > Dec 18 19:56:06 cl032 kernel: CMAN: we are leaving the cluster. > > Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 2 > > Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 3 > > Dec 18 19:56:07 cl032 kernel: SM: 00000001 sm_stop: SG still joined > > Dec 18 19:56:07 cl032 kernel: SM: 0100081e sm_stop: SG still joined > > Dec 18 19:56:07 cl032 kernel: SM: 0200081f sm_stop: SG still joined > > > > cl031: > > Dec 18 19:56:02 cl031 kernel: CMAN: node cl032a is not responding - removing from the cluster > > Dec 18 19:56:06 cl031 kernel: CMAN: Being told to leave the cluster by node 1 > > Dec 18 19:56:06 cl031 kernel: CMAN: we are leaving the cluster. > > Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 2 > > Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 3 > > Dec 18 19:56:07 cl031 kernel: SM: 00000001 sm_stop: SG still joined > > Dec 18 19:56:07 cl031 kernel: SM: 0100081e sm_stop: SG still joined > > > > cl030: > > Dec 18 19:56:05 cl030 kernel: CMAN: bad generation number 10 in HELLO message, expected 9 > > Dec 18 19:56:06 cl030 kernel: CMAN: Node cl031a is leaving the cluster, Shutdown > > Dec 18 19:56:06 cl030 kernel: CMAN: quorum lost, blocking activity > > > > Looks like cl032 had the most problems. It hit a bug of asserts: > > $ grep BUG cl032.messages > > Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:400! > > Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342! > > Dec 18 20:01:06 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342! > > Dec 18 20:01:07 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342! > > > > Questions: > > Any ideas on what is going on here? > > > > How does one know what the current "generation" number is? > > You don't, cman does. it's the current "generation" of the cluster which is > incremented for each state transition. Are you taking nodes up and down during > these tests?? The nodes are staying up. I am mounting and umounting a lot. Any reason to not add generation /proc/cluster/status? (it would help debugging at least). > > It does seem that cman is susceptible to heavy network traffic, despite my best > efforts to increase its priority. I'm going to check in a change that will allow > you to change the retry count byt it's a bit of a hack really. > > > > When CMAN gets an error, it is not shutting down all the cluster > > software correctly. GFS is still mounted and anything accessing > > it is hung. For debugging it is ok for the machine to stay up > > so we can figure out what is going on, but for a real operational > > cluster this is very bad. In normal operation, if the cluster > > hits a bug likes this shouldn't it just reboot, so hopefully > > all the other nodes can recover? > > If you have power switch fencing and the remainder of the node is quorate then > surely the failed node should be powercycled? > I currently have it set up for manual fencing and I have yet to see that work correctly. This was a 3 node cluster. cl032 got the bad generation number and cman was "killed by STARTTRANS or NOMINATE" cl030 got a bad generation number (but stayed up) and cl031 leaves the cluster because it says cl030 told it to. So that leaves me with 1 node up without quorum. I did not see any fencing messages. Should the surviving node (cl030) have attempted fencing or does it only do that if it has quorum? I do not seem to be able to keep cman up for much past 2 days if I have my tests running. (it stays up with no load, of course). My tests are not the complicated currently either. Just tar, du and rm in separate directories from 1, 2 and then 3 nodes simultaneously. Who knows what will happen if I add tests to cause lots of dlm lock conflict. How long does cman stay up in your testing? Daniel