[Linux-cluster] cluster failed after 53 hours

Daniel McNeil <daniel@xxxxxxxx> · Mon, 17 Jan 2005 17:31:33 -0800

My 3 node cluster ran tests for 53 hours before hitting a problem.

Node cl031 hit the 1st problem CMAN: killed by STARTTRANS or
NOMINATE.  There is a DLM assert on cl031 also, but that is
after a whole bunch of debug output.  The full logs are
here (http://developer.osdl.org/daniel/GFS/test.12jan2005/)

Any ideas on what is going on?

Here is simplified output (in the README file):
test started Jan Wed 12 17:18
hung after Fri Jan 14 22:00

cl031 got an error in just under 53 hours.
==========================================
Jan 14 22:00:38 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages
Jan 14 22:00:38 cl031 kernel: CMAN: killed by STARTTRANS or NOMINATE
Jan 14 22:00:38 cl031 kernel: CMAN: we are leaving the cluster.
Jan 14 22:00:38 cl031 kernel: name "       2          54aef1" flags 2 nodeid 0 ref 1
Jan 14 22:00:38 cl031 kernel: G 0029017f gr 5 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,5
[snip 34980 lines]
Jan 14 22:10:07 cl031 kernel: G 00010165 gr 5 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,5
Jan 14 22:10:07 cl031 kernel:  3 to 3 id 432
Jan 14 22:10:07 cl031 kernel: stripefs updated 350 resources
Jan 14 22:10:07 cl031 kernel: stripefs rebuild locks
Jan 14 22:10:07 cl031 kernel: stripefs rebuilt 0 locks
Jan 14 22:10:07 cl031 kernel: stripefs recover event 6122 done
Jan 14 22:10:07 cl031 kernel: stripefs rcom status f to 3
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 433
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 434
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 435
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 436
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 437
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 438
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 439
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 440
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 441
Jan 14 22:10:07 cl031 kernel: stripefs rcom send 6 to 3 id 442
Jan 14 22:10:07 cl031 kernel: stripefs move flags 0,0,1 ids 6119,6122,6122
Jan 14 22:10:07 cl031 kernel: stripefs process held requests
Jan 14 22:10:07 cl031 kernel: stripefs processed 0 requests
Jan 14 22:10:07 cl031 kernel: stripefs resend marked requests
Jan 14 22:10:07 cl031 kernel: stripefs resent 0 requests
Jan 14 22:10:07 cl031 kernel: stripefs recover event 6122 finished
Jan 14 22:10:07 cl031 kernel: stripefs move flags 1,0,0 ids 6122,6122,6122
Jan 14 22:10:07 cl031 kernel: stripefs add_to_requestq cmd 1 fr 3
Jan 14 22:10:08 cl031 kernel: stripefs move flags 0,0,0 ids 6122,6122,6122
Jan 14 22:10:08 cl031 kernel: stripefs rcom status 0 to 1
Jan 14 22:10:08 cl031 kernel: stripefs move flags 0,1,0 ids 6122,6123,6122
Jan 14 22:10:08 cl031 kernel: stripefs move use event 6123
Jan 14 22:10:08 cl031 kernel: stripefs recover event 6123
Jan 14 22:10:08 cl031 kernel: stripefs add node 1
Jan 14 22:10:08 cl031 kernel: stripefs rcom send 1 to 1 id 443
Jan 14 22:10:08 cl031 kernel: stripefs rcom status 4 to 1
Jan 14 22:10:08 cl031 kernel:
jan 14 22:10:08 cl031 kernel: DLM:  Assertion failed on line 128 of file /Views/redhat-cluster/cluster/dlm-kernel/src/reccomms.c
Jan 14 22:10:08 cl031 kernel: DLM:  assertion:  "error >= 0"
Jan 14 22:10:08 cl031 kernel: DLM:  time = 201619244
Jan 14 22:10:08 cl031 kernel: error = -105
Jan 14 22:10:08 cl031 kernel:

>From reccoms.c:
        error = midcomms_send_message(nodeid, (struct dlm_header *) rc,
                                      GFP_KERNEL);
        DLM_ASSERT(error >= 0, printk("error = %d\n", error););

cl030
=====
Jan 14 22:00:38 cl030 kernel: CMAN: removing node cl031a from the cluster : No rresponse to messages
Jan 14 22:00:39 cl030 kernel: dlm: stripefs: nodes_init failed -1
Jan 14 22:00:39 cl030 fence_manual: Node cl031a needs to be reset before
recoverry can procede.  Waiting for cl031a to rejoin the cluster or for
manual acknowleddgement that it has been reset (i.e. fence_ack_manual -s cl031a)
(2 hours and 45 minutes later  Sat Jan 15 00:45:00)
Jan 15 00:50:12 cl030 kernel: CMAN: nmembers in HELLO message from 3 does not maatch our view (got 1, exp 2)
Jan 15 00:52:57 cl030 kernel: CMAN: too many transition restarts - will die
Jan 15 00:52:57 cl030 kernel: CMAN: we are leaving the cluster. Inconsistent cluuster view

cl032 
=====
Jan 14 22:00:38 cl032 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages
Jan 14 22:00:39 cl032 kernel: dlm: stripefs: nodes_reconfig failed 1
Jan 14 22:00:39 cl032 fenced[8983]: fencing deferred to 1
Jan 15 00:50:08 cl032 kernel: CMAN: removing node cl030a from the cluster : No response to messages
Jan 15 00:50:08 cl032 kernel: CMAN: quorum lost, blocking activity
Jan 15 00:53:02 cl032 kernel: SM: 00000001 process_recovery_barrier status=-104

Daniel