From: David Teigland [mailto:teigland@xxxxxxxxxx] Sent: den 24 september 2007 18:10 To: Borgström Jonas Cc: linux clustering Subject: Re: Possible cman init script race condition *snip* > > 1190645596 node "prod-db1" not a cman member, cn 1 > > 1190645597 node "prod-db1" not a cman member, cn 1 > > 1190645598 node "prod-db1" not a cman member, cn 1 > > 1190645599 node "prod-db1" not a cman member, cn 1 > > 1190645600 reduce victim prod-db1 > > 1190645600 delay of 16s leaves 0 victims > > 1190645600 finish default 1 > > 1190645600 stop default > > 1190645600 start default 2 members 1 2 > > 1190645600 do_recovery stop 1 start 2 finish 1 > > I think something has gone wrong here, either in groupd or fenced, that's > preventing this start from finishing (we don't get a 'finish default 2' > which we expect). A 'group_tool -v' here should show the state of the > fence group still in transition. Could you run that, plus a 'group_tool > dump' at this point, in addition to the 'dump fence' you have. And please > run those commands on both nodes. > Hi david, thanks for your fast response. Here's the output you requested: [root@prod-db1 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_START_WAIT 2 200020001 1 [1 2] [root@prod-db2 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010002 JOIN_START_WAIT 1 100020001 1 [1 2] I attached "group_tool dump" output as files, since they are quite long. > > 1190645954 client 3: dump <--- Before killing prod-db1 > > 1190645985 stop default > > 1190645985 start default 3 members 2 > > 1190645985 do_recovery stop 2 start 3 finish 1 > > 1190645985 finish default 3 > > 1190646008 client 3: dump <--- After killing prod-db1 > > Node 1 isn't fenced here because it never completed joining the fence > group above. > > > The scary part is that as far as I can tell fenced is the only cman > > daemon being affected by this. So your cluster appears to work fine. But > > when a node needs to be fenced the operation it isn't carried out and > > that can cause gfs filesystem corruption. > > You shouldn't be able to mount gfs on the node where joining the fence > group is stuck. My current setup is very stripped down so I haven't configured gfs. But on my original setup where I initially noticed this issue I had no problem mounting gfs filesystems and after a simulated network failure I could still continue to write to the filesystem from both nodes since no node was fenced, and that quickly corrupted the filesystem. Regards, Jonas
1190653002 cman: our nodeid 2 name prod-db2 quorum 1 1190653002 setup_cpg groupd_handle 6b8b456700000000 1190653002 groupd confchg total 1 left 0 joined 1 1190653002 client connection 3 1190653002 got client 3 setup 1190653002 setup fence 0 1190653002 client connection 4 1190653002 got client 4 setup 1190653002 setup dlm 1 1190653003 client connection 5 1190653003 got client 5 setup 1190653003 setup gfs 2 1190653004 got client 3 join 1190653004 0:default got join 1190653004 0:default is cpg client 6 name 0_default handle 327b23c600000001 1190653004 0:default cpg_join ok 1190653004 0:default waiting for first cpg event 1190653004 0:default confchg left 0 joined 1 total 1 1190653004 0:default process_node_join 2 1190653004 0:default cpg add node 2 total 1 1190653004 0:default create group id 10002 our_nodeid 2 1190653004 0:default make_event_id 200010001 nodeid 2 memb_count 1 type 1 1190653004 0:default queue join event for nodeid 2 1190653004 0:default process_current_event 200010001 2 JOIN_BEGIN 1190653004 0:default app node init: add 2 total 1 1190653004 0:default process_current_event 200010001 2 JOIN_ALL_STOPPED 1190653004 0:default action for app: setid default 65538 1190653004 0:default action for app: start default 1 2 1 2 1190653004 client connection 7 1190653004 got client 7 get_group 1190653004 client connection 7 1190653004 got client 7 get_group 1190653005 client connection 7 1190653005 got client 7 get_group 1190653006 client connection 7 1190653006 got client 7 get_group 1190653007 client connection 7 1190653007 got client 7 get_group 1190653008 client connection 7 1190653008 got client 7 get_group 1190653009 client connection 7 1190653009 got client 7 get_group 1190653010 client connection 7 1190653010 got client 7 get_group 1190653011 client connection 7 1190653011 got client 7 get_group 1190653012 client connection 7 1190653012 got client 7 get_group 1190653013 client connection 7 1190653013 got client 7 get_group 1190653014 client connection 7 1190653014 got client 7 get_group 1190653015 client connection 7 1190653015 got client 7 get_group 1190653016 client connection 7 1190653016 got client 7 get_group 1190653017 client connection 7 1190653017 got client 7 get_group 1190653018 client connection 7 1190653018 got client 7 get_group 1190653019 client connection 7 1190653019 got client 7 get_group 1190653020 client connection 7 1190653020 got client 7 get_group 1190653021 client connection 7 1190653021 got client 7 get_group 1190653022 client connection 7 1190653022 got client 7 get_group 1190653023 client connection 7 1190653023 got client 7 get_group 1190653024 client connection 7 1190653024 got client 7 get_group 1190653025 client connection 7 1190653025 got client 7 get_group 1190653026 client connection 7 1190653026 got client 7 get_group 1190653027 client connection 7 1190653027 got client 7 get_group 1190653028 client connection 7 1190653028 got client 7 get_group 1190653029 client connection 7 1190653029 got client 7 get_group 1190653030 client connection 7 1190653030 got client 7 get_group 1190653031 client connection 7 1190653031 got client 7 get_group 1190653032 client connection 7 1190653032 got client 7 get_group 1190653033 client connection 7 1190653033 got client 7 get_group 1190653034 client connection 7 1190653034 got client 7 get_group 1190653035 client connection 7 1190653035 got client 7 get_group 1190653036 client connection 7 1190653036 got client 7 get_group 1190653037 client connection 7 1190653037 got client 7 get_group 1190653038 client connection 7 1190653038 got client 7 get_group 1190653039 client connection 7 1190653039 got client 7 get_group 1190653040 client connection 7 1190653040 got client 7 get_group 1190653040 cman: node 1 added 1190653040 groupd confchg total 2 left 0 joined 1 1190653040 0:default confchg left 0 joined 1 total 2 1190653040 0:default process_node_join 1 1190653040 0:default cpg add node 1 total 2 1190653040 0:default make_event_id 100020001 nodeid 1 memb_count 2 type 1 1190653040 0:default queue join event for nodeid 1 1190653041 0:default ignore msg from 1 id 100010001 started 1190653041 0:default ignore msg from 1 id 200020001 stopped 1190653041 0:default ignore msg from 1 id 200020001 started 1190653041 client connection 7 1190653041 got client 7 get_group 1190653041 client connection 7 1190653041 got client 7 get_group 1190653041 got client 3 start_done 1190653041 0:default send started 1190653041 0:default mark node 2 started 1190653041 0:default process_current_event 200010001 2 JOIN_ALL_STARTED 1190653041 0:default action for app: finish default 1 1190653041 0:default purge msg 200020001 from 2 EST_JOIN_BEGIN 1190653041 0:default purge msg 200020001 from 2 EST_JOIN_BEGIN 1190653041 0:default process_current_event 100020001 1 JOIN_BEGIN 1190653041 0:default action for app: stop default 1190653041 0:default app node join: add 1 total 2 1190653041 got client 3 stop_done 1190653041 0:default send stopped 1190653041 0:default waiting for 1 more stopped messages before JOIN_ALL_STOPPED 1 1190653041 0:default mark node 2 stopped 1190653041 0:default process_current_event 100020001 1 JOIN_ALL_STOPPED 1190653041 0:default action for app: start default 2 2 2 1 2 1190653041 got client 3 start_done 1190653041 0:default send started 1190653041 0:default mark node 2 started 1190653042 client connection 7 1190653042 got client 7 get_group 1190653099 client connection 7 1190653099 got client 7 get_groups 1190653143 client connection 7 1190653143 got client 7 dump
1190653023 cman: our nodeid 1 name prod-db1 quorum 1 1190653023 setup_cpg groupd_handle 6b8b456700000000 1190653023 groupd confchg total 1 left 0 joined 1 1190653023 client connection 3 1190653023 got client 3 setup 1190653023 setup fence 0 1190653023 client connection 4 1190653023 got client 4 setup 1190653023 setup dlm 1 1190653024 client connection 5 1190653024 got client 5 setup 1190653024 setup gfs 2 1190653025 got client 3 join 1190653025 0:default got join 1190653025 0:default is cpg client 6 name 0_default handle 327b23c600000001 1190653025 0:default cpg_join ok 1190653025 0:default waiting for first cpg event 1190653025 client connection 7 1190653025 0:default waiting for first cpg event 1190653025 got client 7 get_group 1190653025 0:default waiting for first cpg event 1190653025 0:default confchg left 0 joined 1 total 1 1190653025 0:default process_node_join 1 1190653025 0:default cpg add node 1 total 1 1190653025 0:default create group id 10001 our_nodeid 1 1190653025 0:default make_event_id 100010001 nodeid 1 memb_count 1 type 1 1190653025 0:default queue join event for nodeid 1 1190653025 0:default process_current_event 100010001 1 JOIN_BEGIN 1190653025 0:default app node init: add 1 total 1 1190653025 0:default process_current_event 100010001 1 JOIN_ALL_STOPPED 1190653025 0:default action for app: setid default 65537 1190653025 0:default action for app: start default 1 2 1 1 1190653025 client connection 7 1190653025 got client 7 get_group 1190653026 client connection 7 1190653026 got client 7 get_group 1190653027 client connection 7 1190653027 got client 7 get_group 1190653028 client connection 7 1190653028 got client 7 get_group 1190653029 client connection 7 1190653029 got client 7 get_group 1190653030 client connection 7 1190653030 got client 7 get_group 1190653031 client connection 7 1190653031 got client 7 get_group 1190653032 client connection 7 1190653032 got client 7 get_group 1190653033 client connection 7 1190653033 got client 7 get_group 1190653034 client connection 7 1190653034 got client 7 get_group 1190653035 client connection 7 1190653035 got client 7 get_group 1190653036 client connection 7 1190653036 got client 7 get_group 1190653037 client connection 7 1190653037 got client 7 get_group 1190653038 client connection 7 1190653038 got client 7 get_group 1190653039 client connection 7 1190653039 got client 7 get_group 1190653040 client connection 7 1190653040 got client 7 get_group 1190653040 cman: node 2 added 1190653040 groupd confchg total 2 left 0 joined 1 1190653040 0:default confchg left 0 joined 1 total 2 1190653040 0:default process_node_join 2 1190653040 0:default cpg add node 2 total 2 1190653040 0:default make_event_id 200020001 nodeid 2 memb_count 2 type 1 1190653040 0:default queue join event for nodeid 2 1190653041 client connection 7 1190653041 got client 7 get_group 1190653041 client connection 7 1190653041 got client 7 get_group 1190653041 got client 3 start_done 1190653041 0:default send started 1190653041 0:default mark node 1 started 1190653041 0:default process_current_event 100010001 1 JOIN_ALL_STARTED 1190653041 0:default action for app: finish default 1 1190653041 0:default process_current_event 200020001 2 JOIN_BEGIN 1190653041 0:default action for app: stop default 1190653041 0:default app node join: add 2 total 2 1190653041 got client 3 stop_done 1190653041 0:default send stopped 1190653041 0:default waiting for 1 more stopped messages before JOIN_ALL_STOPPED 2 1190653041 0:default mark node 1 stopped 1190653041 0:default process_current_event 200020001 2 JOIN_ALL_STOPPED 1190653041 0:default action for app: start default 2 2 2 2 1 1190653041 got client 3 start_done 1190653041 0:default send started 1190653041 0:default mark node 1 started 1190653041 0:default ignore msg from 2 id 200010001 started 1190653041 0:default ignore msg from 2 id 100020001 stopped 1190653041 0:default ignore msg from 2 id 100020001 started 1190653042 client connection 7 1190653042 got client 7 get_group 1190653080 client connection 7 1190653080 got client 7 get_groups 1190653124 client connection 7 1190653124 got client 7 dump
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster