I was running my test on a 3 node cluster and it died after 11 hours. cl030 lost quorum with the other 2 nodes kicked out of the cluster. cl031 also hit a bunch of asserts like lock_dlm: Assertion failed on line 352 of file /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 291694516 stripefs: error=-22 num=2,19 I assume is caused by the cluster shutting down. /var/log/messages showed: cl030: Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl032a from the cluster : No response to messages Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl031a from the cluster : No response to messages Feb 11 02:44:33 cl030 kernel: CMAN: quorum lost, blocking activity Feb 11 14:40:33 cl030 sshd(pam_unix)[27323]: session opened for user root by (uid=0) cl031: Feb 11 02:44:33 cl031 kernel: CMAN: node cl032a has been removed from the cluster : No response to messages Feb 11 02:44:33 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages Feb 11 02:44:33 cl031 kernel: CMAN: killed by NODEDOWN message Feb 11 02:44:33 cl031 kernel: CMAN: we are leaving the cluster. Feb 11 02:44:34 cl031 kernel: lowcomms_get_buffer: accepting is 0 Feb 11 02:44:34 cl031 kernel: dlm: stripefs: remote_stage error -105 2019c Feb 11 02:44:34 cl031 ccsd[3823]: [cluster_mgr.c:387] Cluster manager shutdown. Attemping to reconnect... Feb 11 02:44:34 cl031 kernel: SM: 00000001 sm_stop: SG still joined Feb 11 02:44:34 cl031 kernel: SM: 0100041e sm_stop: SG still joined Feb 11 02:44:34 cl031 kernel: SM: 0200041f sm_stop: SG still joined Feb 11 02:44:37 cl031 ccsd[3823]: [cluster_mgr.c:346] Unable to connect to cluster infrastructure after 30 seconds. Feb 11 02:45:07 cl031 ccsd[3823]: [cluster_mgr.c:346] Unable to connect to cluster infrastructure after 60 seconds. cl032: Feb 11 02:44:33 cl032 kernel: CMAN: node cl032a has been removed from the cluster : No response to messages Feb 11 02:44:33 cl032 kernel: CMAN: killed by NODEDOWN message Feb 11 02:44:33 cl032 kernel: CMAN: we are leaving the cluster. Feb 11 02:44:34 cl032 kernel: lowcomms_get_buffer: accepting is 0 Feb 11 02:44:34 cl032 kernel: dlm: stripefs: remote_stage error -105 102bd Feb 11 02:44:34 cl032 kernel: lowcomms_get_buffer: accepting is 0 Feb 11 02:44:34 cl032 ccsd[22909]: [cluster_mgr.c:387] Cluster manager shutdown. Attemping to reconnect... Feb 11 02:44:34 cl032 kernel: SM: 00000001 sm_stop: SG still joined Feb 11 02:44:34 cl032 kernel: SM: 0100041e sm_stop: SG still joined Feb 11 02:44:34 cl032 kernel: SM: 0200041f sm_stop: SG still joined Feb 11 02:44:53 cl032 ccsd[22909]: [cluster_mgr.c:346] Unable to connect to cluster infrastructure after 90 seconds. More info available here: http://developer.osdl.org/daniel/GFS/test.10feb2005/ I usually get closer to 50 hours before problems. Any ideas? Daniel