My 3 node cluster ran tests for 53 hours before hitting a problem. Node cl031 hit the 1st problem CMAN: killed by STARTTRANS or NOMINATE. There is a DLM assert on cl031 also, but that is after a whole bunch of debug output. The full logs are here (http://developer.osdl.org/daniel/GFS/test.12jan2005/) Any ideas on what is going on? Here is simplified output (in the README file): test started Jan Wed 12 17:18 hung after Fri Jan 14 22:00 cl031 got an error in just under 53 hours. ========================================== Jan 14 22:00:38 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages Jan 14 22:00:38 cl031 kernel: CMAN: killed by STARTTRANS or NOMINATE Jan 14 22:00:38 cl031 kernel: CMAN: we are leaving the cluster. Jan 14 22:00:38 cl031 kernel: name " 2 54aef1" flags 2 nodeid 0 ref 1 Jan 14 22:00:38 cl031 kernel: G 0029017f gr 5 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,5 [snip 34980 lines] Jan 14 22:10:07 cl031 kernel: G 00010165 gr 5 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,5 Jan 14 22:10:07 cl031 kernel: 3 to 3 id 432 Jan 14 22:10:07 cl031 kernel: stripefs updated 350 resources Jan 14 22:10:07 cl031 kernel: stripefs rebuild locks Jan 14 22:10:07 cl031 kernel: stripefs rebuilt 0 locks Jan 14 22:10:07 cl031 kernel: stripefs recover event 6122 done Jan 14 22:10:07 cl031 kernel: stripefs rcom status f to 3 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 433 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 434 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 435 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 436 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 437 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 438 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 439 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 440 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 441 Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 442 Jan 14 22:10:07 cl031 kernel: stripefs move flags 0,0,1 ids 6119,6122,6122 Jan 14 22:10:07 cl031 kernel: stripefs process held requests Jan 14 22:10:07 cl031 kernel: stripefs processed 0 requests Jan 14 22:10:07 cl031 kernel: stripefs resend marked requests Jan 14 22:10:07 cl031 kernel: stripefs resent 0 requests Jan 14 22:10:07 cl031 kernel: stripefs recover event 6122 finished Jan 14 22:10:07 cl031 kernel: stripefs move flags 1,0,0 ids 6122,6122,6122 Jan 14 22:10:07 cl031 kernel: stripefs add_to_requestq cmd 1 fr 3 Jan 14 22:10:08 cl031 kernel: stripefs move flags 0,0,0 ids 6122,6122,6122 Jan 14 22:10:08 cl031 kernel: stripefs rcom status 0 to 1 Jan 14 22:10:08 cl031 kernel: stripefs move flags 0,1,0 ids 6122,6123,6122 Jan 14 22:10:08 cl031 kernel: stripefs move use event 6123 Jan 14 22:10:08 cl031 kernel: stripefs recover event 6123 Jan 14 22:10:08 cl031 kernel: stripefs add node 1 Jan 14 22:10:08 cl031 kernel: stripefs rcom send 1 to 1 id 443 Jan 14 22:10:08 cl031 kernel: stripefs rcom status 4 to 1 Jan 14 22:10:08 cl031 kernel: jan 14 22:10:08 cl031 kernel: DLM: Assertion failed on line 128 of file /Views/redhat-cluster/cluster/dlm-kernel/src/reccomms.c Jan 14 22:10:08 cl031 kernel: DLM: assertion: "error >= 0" Jan 14 22:10:08 cl031 kernel: DLM: time = 201619244 Jan 14 22:10:08 cl031 kernel: error = -105 Jan 14 22:10:08 cl031 kernel: >From reccoms.c: error = midcomms_send_message(nodeid, (struct dlm_header *) rc, GFP_KERNEL); DLM_ASSERT(error >= 0, printk("error = %d\n", error);); cl030 ===== Jan 14 22:00:38 cl030 kernel: CMAN: removing node cl031a from the cluster : No rresponse to messages Jan 14 22:00:39 cl030 kernel: dlm: stripefs: nodes_init failed -1 Jan 14 22:00:39 cl030 fence_manual: Node cl031a needs to be reset before recoverry can procede. Waiting for cl031a to rejoin the cluster or for manual acknowleddgement that it has been reset (i.e. fence_ack_manual -s cl031a) (2 hours and 45 minutes later Sat Jan 15 00:45:00) Jan 15 00:50:12 cl030 kernel: CMAN: nmembers in HELLO message from 3 does not maatch our view (got 1, exp 2) Jan 15 00:52:57 cl030 kernel: CMAN: too many transition restarts - will die Jan 15 00:52:57 cl030 kernel: CMAN: we are leaving the cluster. Inconsistent cluuster view cl032 ===== Jan 14 22:00:38 cl032 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages Jan 14 22:00:39 cl032 kernel: dlm: stripefs: nodes_reconfig failed 1 Jan 14 22:00:39 cl032 fenced[8983]: fencing deferred to 1 Jan 15 00:50:08 cl032 kernel: CMAN: removing node cl030a from the cluster : No response to messages Jan 15 00:50:08 cl032 kernel: CMAN: quorum lost, blocking activity Jan 15 00:53:02 cl032 kernel: SM: 00000001 process_recovery_barrier status=-104 Daniel