Hi, I have recently setup an HA cluster with two nodes, IPMI based fencing and no quorum disk. Things worked nicely during the first tests, but to my very annoyance it blew up last night when I did another test of shutting down the network interface on my secondary node (node 2). The node was fenced as expected and came back online. This however resulted in an immediate fencing of the other node. Fencing went back and forth until I manually powered of node 2 and let node 1 a few minutes to settle down. Now when I switch node 2 back on, it looks like it joins the cluster and is kicked out immediately again, which again results in fencing of node 2. I have purposely set the post_join_delay to a high value, but it didn't help. Below are my cluster.conf and log files. My own guess would be that the problem is associated with the fact that the node tries to do a stateful merge, when it really should be joining without state after a clean reboot. (see fence_tool dump line 9). -------------- root@rmg-de-1:~# cat /etc/pve/cluster.conf <?xml version="1.0"?> <cluster config_version="14" name="rmg-de-cl1"> <cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/> <fencedevices> <fencedevice agent="fence_ipmilan" ipaddr="10.xx.xx.11" login="FENCING" name="fenceNode1" passwd="abc"/> <fencedevice agent="fence_ipmilan" ipaddr="10.xx.xx.12" login="FENCING" name="fenceNode2" passwd="abc"/> </fencedevices> <clusternodes> <clusternode name="rmg-de-1" nodeid="1" votes="1"> <fence> <method name="1"> <device action="reboot" name="fenceNode1"/> </method> </fence> </clusternode> <clusternode name="rmg-de-2" nodeid="2" votes="1"> <fence> <method name="1"> <device action="reboot" name="fenceNode2"/> </method> </fence> </clusternode> </clusternodes> <fence_daemon post_join_delay="360" /> <rm> <pvevm autostart="1" vmid="101"/> <pvevm autostart="1" vmid="100"/> <pvevm autostart="1" vmid="104"/> <pvevm autostart="1" vmid="103"/> <pvevm autostart="1" vmid="102"/> </rm> </cluster> -------------- -------------- root@rmg-de-1:~# fence_tool dump | tail -n 40 1378890849 daemon node 1 max 1.1.1.0 run 1.1.1.1 1378890849 daemon node 1 join 1378855487 left 0 local quorum 1378855487 1378890849 receive_start 1:12 len 152 1378890849 match_change 1:12 matches cg 12 1378890849 wait_messages cg 12 need 1 of 2 1378890850 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1 1378890850 daemon node 2 max 0.0.0.0 run 0.0.0.0 1378890850 daemon node 2 join 1378890849 left 1378859110 local quorum 1378855487 1378890850 daemon node 2 stateful merge 1378890850 daemon node 2 kill due to stateful merge 1378890850 telling cman to remove nodeid 2 from cluster 1378890862 cluster node 2 removed seq 832 1378890862 fenced:daemon conf 1 0 1 memb 1 join left 2 1378890862 fenced:daemon ring 1:832 1 memb 1 1378890862 fenced:default conf 1 0 1 memb 1 join left 2 1378890862 add_change cg 13 remove nodeid 2 reason 3 1378890862 add_change cg 13 m 1 j 0 r 1 f 1 1378890862 add_victims node 2 1378890862 check_ringid cluster 832 cpg 1:828 1378890862 fenced:default ring 1:832 1 memb 1 1378890862 check_ringid done cluster 832 cpg 1:832 1378890862 check_quorum done 1378890862 send_start 1:13 flags 2 started 6 m 1 j 0 r 1 f 1 1378890862 cpg_mcast_joined retried 1 start 1378890862 receive_start 1:13 len 152 1378890862 match_change 1:13 skip cg 12 already start 1378890862 match_change 1:13 matches cg 13 1378890862 wait_messages cg 13 got all 1 1378890862 set_master from 1 to complete node 1 1378890862 delay post_join_delay 360 quorate_from_last_update 0 1378891222 delay of 360s leaves 1 victims 1378891222 rmg-de-2 not a cluster member after 360 sec post_join_delay 1378891222 fencing node rmg-de-2 1378891236 fence rmg-de-2 dev 0.0 agent fence_ipmilan result: success 1378891236 fence rmg-de-2 success 1378891236 send_victim_done cg 13 flags 2 victim nodeid 2 1378891236 send_complete 1:13 flags 2 started 6 m 1 j 0 r 1 f 1 1378891236 receive_victim_done 1:13 flags 2 len 80 1378891236 receive_victim_done 1:13 remove victim 2 time 1378891236 how 1 1378891236 receive_complete 1:13 len 152: -------------- -------------- root@rmg-de-1:~# tail -n 100 /var/log/cluster/corosync.log Sep 11 11:14:09 corosync [CLM ] CLM CONFIGURATION CHANGE Sep 11 11:14:09 corosync [CLM ] New Configuration: Sep 11 11:14:09 corosync [CLM ] r(0) ip(10.xx.xx.1) Sep 11 11:14:09 corosync [CLM ] Members Left: Sep 11 11:14:09 corosync [CLM ] Members Joined: Sep 11 11:14:09 corosync [CLM ] CLM CONFIGURATION CHANGE Sep 11 11:14:09 corosync [CLM ] New Configuration: Sep 11 11:14:09 corosync [CLM ] r(0) ip(10.xx.xx.1) Sep 11 11:14:09 corosync [CLM ] r(0) ip(10.xx.xx.2) Sep 11 11:14:09 corosync [CLM ] Members Left: Sep 11 11:14:09 corosync [CLM ] Members Joined: Sep 11 11:14:09 corosync [CLM ] r(0) ip(10.xx.xx.2) Sep 11 11:14:09 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2 Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2 Sep 11 11:14:09 corosync [CPG ] chosen downlist: sender r(0) ip(10.xx.xx.1) ; members(old:1 left:0) Sep 11 11:14:09 corosync [MAIN ] Completed service synchronization, ready to provide service. Sep 11 11:14:20 corosync [TOTEM ] A processor failed, forming new configuration. Sep 11 11:14:22 corosync [CLM ] CLM CONFIGURATION CHANGE Sep 11 11:14:22 corosync [CLM ] New Configuration: Sep 11 11:14:22 corosync [CLM ] r(0) ip(10.xx.xx.1) Sep 11 11:14:22 corosync [CLM ] Members Left: Sep 11 11:14:22 corosync [CLM ] r(0) ip(10.xx.xx.2) Sep 11 11:14:22 corosync [CLM ] Members Joined: Sep 11 11:14:22 corosync [QUORUM] Members[1]: 1 Sep 11 11:14:22 corosync [CLM ] CLM CONFIGURATION CHANGE Sep 11 11:14:22 corosync [CLM ] New Configuration: Sep 11 11:14:22 corosync [CLM ] r(0) ip(10.xx.xx.1) Sep 11 11:14:22 corosync [CLM ] Members Left: Sep 11 11:14:22 corosync [CLM ] Members Joined: Sep 11 11:14:22 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 11:14:22 corosync [CPG ] chosen downlist: sender r(0) ip(10.xx.xx.1) ; members(old:2 left:1) Sep 11 11:14:22 corosync [MAIN ] Completed service synchronization, ready to provide service. -------------- -------------- root@rmg-de-1:~# dlm_tool ls dlm lockspaces name rgmanager id 0x5231f3eb flags 0x00000000 change member 1 joined 0 remove 1 failed 1 seq 12,13 members 1 -------------- Unfortunately I only have the output of the currently operational node, as the other one is fenced very quickly and the logs are hard to retrieve. If someone has an idea however, I'll do my best to provide these as well. Thanks, Pascal -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster