Re: Rejoin cluster after failure without reboot?

Oyvind Albrigtsen <oalbrigt@xxxxxxxxxx> · Thu, 26 Nov 2015 09:57:02 +0100

On 26/11/15 09:39, Christine Caulfield wrote:
On 25/11/15 15:22, Jonathan Davies wrote:
Hi,

I'm experimenting with corosync+dlm+gfs2 (approximately following
http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt) and am trying
to establish whether it meets my requirements. I have a query about a
node rejoining a cluster after failure, and want to make sure I'm not
overlooking something.

I have a three-node cluster and deliberately cause token loss by
firewalling one of them (call it node A) out of the network for longer
than the token timeout. At this point, the other two hosts (B and C)
decide that A has disappeared and continue with quorum. That is fine.

When I unfirewall node A, dlm tries to reconnect to its peers on B and
C. But then I see the following on host B:

16:29:25.823496 nodeb dlm_controld[6548]: 908 daemon node 85 stateful merge
16:29:25.823529 nodeb dlm_controld[6548]: 908 daemon node 85 kill due to
stateful merge
16:29:25.823543 nodeb dlm_controld[6548]: 908 tell corosync to remove
nodeid 85 from cluster
16:29:25.823696 nodeb corosync[6536]:  [CFG   ] request to kill node
85(us=83): xxx

and then the following on node A:

16:29:25.828547 nodea corosync[3896]:  [CFG   ] Killed by node 83:
dlm_controld
16:29:25.828575 nodea corosync[3896]:  [MAIN  ] Corosync Cluster Engine
exiting with status -1 at cfg.c:530.
16:29:25.834828 nodea dlm_controld[3466]: 1183 process_cluster_cfg
cfg_dispatch 2
16:29:25.834871 nodea dlm_controld[3466]: 1183 cluster is down, exiting
16:29:25.834886 nodea dlm_controld[3466]: 1183 process_cluster
quorum_dispatch 2
16:29:25.834903 nodea dlm_controld[3466]: 1183 daemon cpg_dispatch error 2
16:29:25.834917 nodea dlm_controld[3466]: 1183 cpg_dispatch error 2
16:29:25.837152 nodea dlm_controld[3466]: 1183 abandoned lockspace mygfs2

resulting in both corosync and dlm_controld exiting on node A.

Later, if I try to manually restart corosync and dlm on node A, I see
the following:

16:32:08.382871 nodea dlm_controld[20483]: 2872 dlm_controld 4.0.2 started
16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolled
lockspace mygfs2
16:32:08.392477 nodea dlm_controld[20483]: 2872 tell corosync to remove
nodeid 85 from cluster
16:32:08.394965 nodea corosync[20456]:  [CFG   ] request to kill node
85(us=85): xxx
16:32:08.394998 nodea corosync[20456]:  [CFG   ] Killed by node 85:
dlm_controld

The only way of making A rejoin the cluster is to reboot.

Yes. You need to implement fencing, so that the node will automatically
be restarted when it leaves the cluster.

CHrissie
You'll probably have to use this patch to make fencing work as expected:
https://github.com/ClusterLabs/pacemaker/pull/839

I would be grateful if you could confirm the following statements:
   (a) The "stateful merge" is unavoidable when node A leaves the cluster
for longer than the token timeout then tries to rejoin.
   (b) Killing corosync on node A is unavoidable when node B sees the
"stateful merge".
   (c) dlm exiting is unavoidable when corosync dies.
   (d) Restarting corosync then dlm on node A will necessarily result in
"found uncontrolled lockspace".
   (e) The only way to recover from "found uncontrolled lockspace" (for a
gfs2 lockspace) is to reboot.

I'm hoping that I'm overlooking something and that at least one of
(a)--(e) is false! I'm not comfortable with a reboot being the only
means of recovery when the token timeout is exceeded.

Thanks,
Jonathan

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster