On Fri, Feb 11, 2005 at 04:47:38PM -0800, Daniel McNeil wrote: > I was running my test on a 3 node cluster and it died > after 11 hours. cl030 lost quorum with the other 2 nodes > kicked out of the cluster. cl031 also hit a bunch of asserts > like > lock_dlm: Assertion failed on line 352 of file > /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c > lock_dlm: assertion: "!error" > lock_dlm: time = 291694516 > stripefs: error=-22 num=2,19 > I assume is caused by the cluster shutting down. > > > /var/log/messages showed: > > cl030: > Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl032a from the cluster : No response to messages > Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl031a from the cluster : No response to messages > Feb 11 02:44:33 cl030 kernel: CMAN: quorum lost, blocking activity > Feb 11 14:40:33 cl030 sshd(pam_unix)[27323]: session opened for user root by (uid=0) You should only get nodes dying from "No response to messages" during a state transition of some sort (eg a node leaving or joining or possibly a GFS mount/dismount). In which case the DLM has to do recovery. I recently checked in a couple of changes that will stop the DLM recovery from taking over the machine when there are several thousand locks to recover, that might help. During a normal "steady" state, a node should not die from "No response to messages" because the only messages that are being sent are HELLO heartbeat messages and they are not acked. -- patrick