Re: Two-node cluster: Node attempts stateful merge after clean reboot

emmanuel segura <emi2fast@xxxxxxxxx> · Wed, 11 Sep 2013 18:04:24 +0200

Hello Pascal

For disable startup fencing you need clean_start=1 in the fence_daemon tag, i saw in your previous mail you are using expected_votes="1", with this setting every cluster node will be partitioned into two clusters and operate independently, i recommended using a quorim disk with master_wins parameter

2013/9/11 Pascal Ehlert <pascal@xxxxxxxxxxxx>

Hi,

I have recently setup an HA cluster with two nodes, IPMI based fencing

and no quorum disk. Things worked nicely during the first tests, but to my

very annoyance it blew up last night when I did another test of shutting

down the network interface on my secondary node (node 2).

The node was fenced as expected and came back online. This however

resulted in an immediate fencing of the other node.

Fencing went back and forth until I manually powered of node 2 and let

node 1 a few minutes to settle down.

Now when I switch node 2 back on, it looks like it joins the cluster and

is kicked out immediately again, which again results in fencing of node

2. I have purposely set the post_join_delay to a high value, but it

didn't help.

Below are my cluster.conf and log files. My own guess would be that the

problem is associated with the fact that the node tries to do a stateful

merge, when it really should be joining without state after a clean

reboot. (see fence_tool dump line 9).

--------------

root@rmg-de-1:~# cat /etc/pve/cluster.conf

<?xml version="1.0"?>

<cluster config_version="14" name="rmg-de-cl1">

  <cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1"/>

  <fencedevices>

    <fencedevice agent="fence_ipmilan" ipaddr="10.xx.xx.11" login="FENCING" name="fenceNode1" passwd="abc"/>

    <fencedevice agent="fence_ipmilan" ipaddr="10.xx.xx.12" login="FENCING" name="fenceNode2" passwd="abc"/>

  </fencedevices>

  <clusternodes>

    <clusternode name="rmg-de-1" nodeid="1" votes="1">

      <fence>

        <method name="1">

          <device action="" name="fenceNode1"/>

        </method>

      </fence>

    </clusternode>

    <clusternode name="rmg-de-2" nodeid="2" votes="1">

      <fence>

        <method name="1">

          <device action="" name="fenceNode2"/>

        </method>

      </fence>

    </clusternode>

  </clusternodes>

  <fence_daemon post_join_delay="360" />

  <rm>

    <pvevm autostart="1" vmid="101"/>

    <pvevm autostart="1" vmid="100"/>

    <pvevm autostart="1" vmid="104"/>

    <pvevm autostart="1" vmid="103"/>

    <pvevm autostart="1" vmid="102"/>

  </rm>

</cluster>

--------------

--------------

root@rmg-de-1:~# fence_tool dump | tail -n 40

1378890849 daemon node 1 max 1.1.1.0 run 1.1.1.1

1378890849 daemon node 1 join 1378855487 left 0 local quorum 1378855487

1378890849 receive_start 1:12 len 152

1378890849 match_change 1:12 matches cg 12

1378890849 wait_messages cg 12 need 1 of 2

1378890850 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1378890850 daemon node 2 max 0.0.0.0 run 0.0.0.0

1378890850 daemon node 2 join 1378890849 left 1378859110 local quorum 1378855487

1378890850 daemon node 2 stateful merge

1378890850 daemon node 2 kill due to stateful merge

1378890850 telling cman to remove nodeid 2 from cluster

1378890862 cluster node 2 removed seq 832

1378890862 fenced:daemon conf 1 0 1 memb 1 join left 2

1378890862 fenced:daemon ring 1:832 1 memb 1

1378890862 fenced:default conf 1 0 1 memb 1 join left 2

1378890862 add_change cg 13 remove nodeid 2 reason 3

1378890862 add_change cg 13 m 1 j 0 r 1 f 1

1378890862 add_victims node 2

1378890862 check_ringid cluster 832 cpg 1:828

1378890862 fenced:default ring 1:832 1 memb 1

1378890862 check_ringid done cluster 832 cpg 1:832

1378890862 check_quorum done

1378890862 send_start 1:13 flags 2 started 6 m 1 j 0 r 1 f 1

1378890862 cpg_mcast_joined retried 1 start

1378890862 receive_start 1:13 len 152

1378890862 match_change 1:13 skip cg 12 already start

1378890862 match_change 1:13 matches cg 13

1378890862 wait_messages cg 13 got all 1

1378890862 set_master from 1 to complete node 1

1378890862 delay post_join_delay 360 quorate_from_last_update 0

1378891222 delay of 360s leaves 1 victims

1378891222 rmg-de-2 not a cluster member after 360 sec post_join_delay

1378891222 fencing node rmg-de-2

1378891236 fence rmg-de-2 dev 0.0 agent fence_ipmilan result: success

1378891236 fence rmg-de-2 success

1378891236 send_victim_done cg 13 flags 2 victim nodeid 2

1378891236 send_complete 1:13 flags 2 started 6 m 1 j 0 r 1 f 1

1378891236 receive_victim_done 1:13 flags 2 len 80

1378891236 receive_victim_done 1:13 remove victim 2 time 1378891236 how 1

1378891236 receive_complete 1:13 len 152:

--------------

--------------

root@rmg-de-1:~# tail -n 100 /var/log/cluster/corosync.log

Sep 11 11:14:09 corosync [CLM   ] CLM CONFIGURATION CHANGE

Sep 11 11:14:09 corosync [CLM   ] New Configuration:

Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.1)

Sep 11 11:14:09 corosync [CLM   ] Members Left:

Sep 11 11:14:09 corosync [CLM   ] Members Joined:

Sep 11 11:14:09 corosync [CLM   ] CLM CONFIGURATION CHANGE

Sep 11 11:14:09 corosync [CLM   ] New Configuration:

Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.1)

Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.2)

Sep 11 11:14:09 corosync [CLM   ] Members Left:

Sep 11 11:14:09 corosync [CLM   ] Members Joined:

Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.2)

Sep 11 11:14:09 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2

Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2

Sep 11 11:14:09 corosync [CPG   ] chosen downlist: sender r(0) ip(10.xx.xx.1) ; members(old:1 left:0)

Sep 11 11:14:09 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Sep 11 11:14:20 corosync [TOTEM ] A processor failed, forming new configuration.

Sep 11 11:14:22 corosync [CLM   ] CLM CONFIGURATION CHANGE

Sep 11 11:14:22 corosync [CLM   ] New Configuration:

Sep 11 11:14:22 corosync [CLM   ]     r(0) ip(10.xx.xx.1)

Sep 11 11:14:22 corosync [CLM   ] Members Left:

Sep 11 11:14:22 corosync [CLM   ]     r(0) ip(10.xx.xx.2)

Sep 11 11:14:22 corosync [CLM   ] Members Joined:

Sep 11 11:14:22 corosync [QUORUM] Members[1]: 1

Sep 11 11:14:22 corosync [CLM   ] CLM CONFIGURATION CHANGE

Sep 11 11:14:22 corosync [CLM   ] New Configuration:

Sep 11 11:14:22 corosync [CLM   ]     r(0) ip(10.xx.xx.1)

Sep 11 11:14:22 corosync [CLM   ] Members Left:

Sep 11 11:14:22 corosync [CLM   ] Members Joined:

Sep 11 11:14:22 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Sep 11 11:14:22 corosync [CPG   ] chosen downlist: sender r(0) ip(10.xx.xx.1) ; members(old:2 left:1)

Sep 11 11:14:22 corosync [MAIN  ] Completed service synchronization, ready to provide service.

--------------

--------------

root@rmg-de-1:~# dlm_tool ls

dlm lockspaces

name          rgmanager

id            0x5231f3eb

flags         0x00000000

change        member 1 joined 0 remove 1 failed 1 seq 12,13

members       1

--------------

Unfortunately I only have the output of the currently operational node,

as the other one is fenced very quickly and the logs are hard to

retrieve. If someone has an idea however, I'll do my best to provide

these as well.

Thanks,

Pascal

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
esta es mi vida e me la vivo hasta que dios quiera

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster