Classic fence loop. Try this doc: https://access.redhat.com/site/solutions/272913 tg. On Wed, Sep 11, 2013 at 01:03:07PM +0200, Pascal Ehlert wrote: > Hi, > > I have recently setup an HA cluster with two nodes, IPMI based fencing > and no quorum disk. Things worked nicely during the first tests, but to my > very annoyance it blew up last night when I did another test of shutting > down the network interface on my secondary node (node 2). > > The node was fenced as expected and came back online. This however > resulted in an immediate fencing of the other node. > Fencing went back and forth until I manually powered of node 2 and let > node 1 a few minutes to settle down. > > Now when I switch node 2 back on, it looks like it joins the cluster and > is kicked out immediately again, which again results in fencing of node > 2. I have purposely set the post_join_delay to a high value, but it > didn't help. > > Below are my cluster.conf and log files. My own guess would be that the > problem is associated with the fact that the node tries to do a stateful > merge, when it really should be joining without state after a clean > reboot. (see fence_tool dump line 9). > > -------------- > root@rmg-de-1:~# cat /etc/pve/cluster.conf > <?xml version="1.0"?> > <cluster config_version="14" name="rmg-de-cl1"> > <cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/> > <fencedevices> > <fencedevice agent="fence_ipmilan" ipaddr="10.xx.xx.11" login="FENCING" name="fenceNode1" passwd="abc"/> > <fencedevice agent="fence_ipmilan" ipaddr="10.xx.xx.12" login="FENCING" name="fenceNode2" passwd="abc"/> > </fencedevices> > <clusternodes> > <clusternode name="rmg-de-1" nodeid="1" votes="1"> > <fence> > <method name="1"> > <device action="reboot" name="fenceNode1"/> > </method> > </fence> > </clusternode> > <clusternode name="rmg-de-2" nodeid="2" votes="1"> > <fence> > <method name="1"> > <device action="reboot" name="fenceNode2"/> > </method> > </fence> > </clusternode> > </clusternodes> > <fence_daemon post_join_delay="360" /> > <rm> > <pvevm autostart="1" vmid="101"/> > <pvevm autostart="1" vmid="100"/> > <pvevm autostart="1" vmid="104"/> > <pvevm autostart="1" vmid="103"/> > <pvevm autostart="1" vmid="102"/> > </rm> > </cluster> > -------------- > > -------------- > root@rmg-de-1:~# fence_tool dump | tail -n 40 > 1378890849 daemon node 1 max 1.1.1.0 run 1.1.1.1 > 1378890849 daemon node 1 join 1378855487 left 0 local quorum 1378855487 > 1378890849 receive_start 1:12 len 152 > 1378890849 match_change 1:12 matches cg 12 > 1378890849 wait_messages cg 12 need 1 of 2 > 1378890850 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1 > 1378890850 daemon node 2 max 0.0.0.0 run 0.0.0.0 > 1378890850 daemon node 2 join 1378890849 left 1378859110 local quorum 1378855487 > 1378890850 daemon node 2 stateful merge > 1378890850 daemon node 2 kill due to stateful merge > 1378890850 telling cman to remove nodeid 2 from cluster > 1378890862 cluster node 2 removed seq 832 > 1378890862 fenced:daemon conf 1 0 1 memb 1 join left 2 > 1378890862 fenced:daemon ring 1:832 1 memb 1 > 1378890862 fenced:default conf 1 0 1 memb 1 join left 2 > 1378890862 add_change cg 13 remove nodeid 2 reason 3 > 1378890862 add_change cg 13 m 1 j 0 r 1 f 1 > 1378890862 add_victims node 2 > 1378890862 check_ringid cluster 832 cpg 1:828 > 1378890862 fenced:default ring 1:832 1 memb 1 > 1378890862 check_ringid done cluster 832 cpg 1:832 > 1378890862 check_quorum done > 1378890862 send_start 1:13 flags 2 started 6 m 1 j 0 r 1 f 1 > 1378890862 cpg_mcast_joined retried 1 start > 1378890862 receive_start 1:13 len 152 > 1378890862 match_change 1:13 skip cg 12 already start > 1378890862 match_change 1:13 matches cg 13 > 1378890862 wait_messages cg 13 got all 1 > 1378890862 set_master from 1 to complete node 1 > 1378890862 delay post_join_delay 360 quorate_from_last_update 0 > 1378891222 delay of 360s leaves 1 victims > 1378891222 rmg-de-2 not a cluster member after 360 sec post_join_delay > 1378891222 fencing node rmg-de-2 > 1378891236 fence rmg-de-2 dev 0.0 agent fence_ipmilan result: success > 1378891236 fence rmg-de-2 success > 1378891236 send_victim_done cg 13 flags 2 victim nodeid 2 > 1378891236 send_complete 1:13 flags 2 started 6 m 1 j 0 r 1 f 1 > 1378891236 receive_victim_done 1:13 flags 2 len 80 > 1378891236 receive_victim_done 1:13 remove victim 2 time 1378891236 how 1 > 1378891236 receive_complete 1:13 len 152: > -------------- > > -------------- > root@rmg-de-1:~# tail -n 100 /var/log/cluster/corosync.log > Sep 11 11:14:09 corosync [CLM ] CLM CONFIGURATION CHANGE > Sep 11 11:14:09 corosync [CLM ] New Configuration: > Sep 11 11:14:09 corosync [CLM ] r(0) ip(10.xx.xx.1) > Sep 11 11:14:09 corosync [CLM ] Members Left: > Sep 11 11:14:09 corosync [CLM ] Members Joined: > Sep 11 11:14:09 corosync [CLM ] CLM CONFIGURATION CHANGE > Sep 11 11:14:09 corosync [CLM ] New Configuration: > Sep 11 11:14:09 corosync [CLM ] r(0) ip(10.xx.xx.1) > Sep 11 11:14:09 corosync [CLM ] r(0) ip(10.xx.xx.2) > Sep 11 11:14:09 corosync [CLM ] Members Left: > Sep 11 11:14:09 corosync [CLM ] Members Joined: > Sep 11 11:14:09 corosync [CLM ] r(0) ip(10.xx.xx.2) > Sep 11 11:14:09 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. > Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2 > Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2 > Sep 11 11:14:09 corosync [CPG ] chosen downlist: sender r(0) ip(10.xx.xx.1) ; members(old:1 left:0) > Sep 11 11:14:09 corosync [MAIN ] Completed service synchronization, ready to provide service. > Sep 11 11:14:20 corosync [TOTEM ] A processor failed, forming new configuration. > Sep 11 11:14:22 corosync [CLM ] CLM CONFIGURATION CHANGE > Sep 11 11:14:22 corosync [CLM ] New Configuration: > Sep 11 11:14:22 corosync [CLM ] r(0) ip(10.xx.xx.1) > Sep 11 11:14:22 corosync [CLM ] Members Left: > Sep 11 11:14:22 corosync [CLM ] r(0) ip(10.xx.xx.2) > Sep 11 11:14:22 corosync [CLM ] Members Joined: > Sep 11 11:14:22 corosync [QUORUM] Members[1]: 1 > Sep 11 11:14:22 corosync [CLM ] CLM CONFIGURATION CHANGE > Sep 11 11:14:22 corosync [CLM ] New Configuration: > Sep 11 11:14:22 corosync [CLM ] r(0) ip(10.xx.xx.1) > Sep 11 11:14:22 corosync [CLM ] Members Left: > Sep 11 11:14:22 corosync [CLM ] Members Joined: > Sep 11 11:14:22 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. > Sep 11 11:14:22 corosync [CPG ] chosen downlist: sender r(0) ip(10.xx.xx.1) ; members(old:2 left:1) > Sep 11 11:14:22 corosync [MAIN ] Completed service synchronization, ready to provide service. > -------------- > > -------------- > root@rmg-de-1:~# dlm_tool ls > dlm lockspaces > name rgmanager > id 0x5231f3eb > flags 0x00000000 > change member 1 joined 0 remove 1 failed 1 seq 12,13 > members 1 > -------------- > > Unfortunately I only have the output of the currently operational node, > as the other one is fenced very quickly and the logs are hard to > retrieve. If someone has an idea however, I'll do my best to provide > these as well. > > Thanks, > > Pascal > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster