[Linux-cluster] cluster lost quorum after 11 hours

Daniel McNeil <daniel@xxxxxxxx> · Fri, 11 Feb 2005 16:47:38 -0800

I was running my test on a 3 node cluster and it died
after 11 hours.  cl030 lost quorum with the other 2 nodes
kicked out of the cluster.  cl031 also hit a bunch of asserts
like
    lock_dlm:  Assertion failed on line 352 of file  
    /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c
    lock_dlm:  assertion:  "!error"
    lock_dlm:  time = 291694516
    stripefs: error=-22 num=2,19
I assume is caused by the cluster shutting down.

/var/log/messages showed:

cl030:
Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl032a from the cluster : No response to messages
Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl031a from the cluster : No response to messages
Feb 11 02:44:33 cl030 kernel: CMAN: quorum lost, blocking activity
Feb 11 14:40:33 cl030 sshd(pam_unix)[27323]: session opened for user root by (uid=0)

cl031:
Feb 11 02:44:33 cl031 kernel: CMAN: node cl032a has been removed from the cluster : No response to messages
Feb 11 02:44:33 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages
Feb 11 02:44:33 cl031 kernel: CMAN: killed by NODEDOWN message
Feb 11 02:44:33 cl031 kernel: CMAN: we are leaving the cluster.
Feb 11 02:44:34 cl031 kernel: lowcomms_get_buffer: accepting is 0
Feb 11 02:44:34 cl031 kernel: dlm: stripefs: remote_stage error -105 2019c
Feb 11 02:44:34 cl031 ccsd[3823]: [cluster_mgr.c:387] Cluster manager shutdown.
 Attemping to reconnect...
Feb 11 02:44:34 cl031 kernel: SM: 00000001 sm_stop: SG still joined
Feb 11 02:44:34 cl031 kernel: SM: 0100041e sm_stop: SG still joined
Feb 11 02:44:34 cl031 kernel: SM: 0200041f sm_stop: SG still joined
Feb 11 02:44:37 cl031 ccsd[3823]: [cluster_mgr.c:346] Unable to connect to cluster infrastructure after 30 seconds.
Feb 11 02:45:07 cl031 ccsd[3823]: [cluster_mgr.c:346] Unable to connect to cluster infrastructure after 60 seconds.

cl032:
Feb 11 02:44:33 cl032 kernel: CMAN: node cl032a has been removed from the cluster : No response to messages
Feb 11 02:44:33 cl032 kernel: CMAN: killed by NODEDOWN message
Feb 11 02:44:33 cl032 kernel: CMAN: we are leaving the cluster.
Feb 11 02:44:34 cl032 kernel: lowcomms_get_buffer: accepting is 0
Feb 11 02:44:34 cl032 kernel: dlm: stripefs: remote_stage error -105 102bd
Feb 11 02:44:34 cl032 kernel: lowcomms_get_buffer: accepting is 0
Feb 11 02:44:34 cl032 ccsd[22909]: [cluster_mgr.c:387] Cluster manager shutdown.  Attemping to reconnect...
Feb 11 02:44:34 cl032 kernel: SM: 00000001 sm_stop: SG still joined
Feb 11 02:44:34 cl032 kernel: SM: 0100041e sm_stop: SG still joined
Feb 11 02:44:34 cl032 kernel: SM: 0200041f sm_stop: SG still joined
Feb 11 02:44:53 cl032 ccsd[22909]: [cluster_mgr.c:346] Unable to connect to cluster infrastructure after 90 seconds.

More info available here:
http://developer.osdl.org/daniel/GFS/test.10feb2005/

I usually get closer to 50 hours before problems. Any ideas?

Daniel