Re: Rejoin cluster after failure without reboot?

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Thu, 26 Nov 2015 08:39:03 +0000

On 25/11/15 15:22, Jonathan Davies wrote:
> Hi,
> 
> I'm experimenting with corosync+dlm+gfs2 (approximately following
> http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt) and am trying
> to establish whether it meets my requirements. I have a query about a
> node rejoining a cluster after failure, and want to make sure I'm not
> overlooking something.
> 
> I have a three-node cluster and deliberately cause token loss by
> firewalling one of them (call it node A) out of the network for longer
> than the token timeout. At this point, the other two hosts (B and C)
> decide that A has disappeared and continue with quorum. That is fine.
> 
> When I unfirewall node A, dlm tries to reconnect to its peers on B and
> C. But then I see the following on host B:
> 
> 16:29:25.823496 nodeb dlm_controld[6548]: 908 daemon node 85 stateful merge
> 16:29:25.823529 nodeb dlm_controld[6548]: 908 daemon node 85 kill due to
> stateful merge
> 16:29:25.823543 nodeb dlm_controld[6548]: 908 tell corosync to remove
> nodeid 85 from cluster
> 16:29:25.823696 nodeb corosync[6536]:  [CFG   ] request to kill node
> 85(us=83): xxx
> 
> and then the following on node A:
> 
> 16:29:25.828547 nodea corosync[3896]:  [CFG   ] Killed by node 83:
> dlm_controld
> 16:29:25.828575 nodea corosync[3896]:  [MAIN  ] Corosync Cluster Engine
> exiting with status -1 at cfg.c:530.
> 16:29:25.834828 nodea dlm_controld[3466]: 1183 process_cluster_cfg
> cfg_dispatch 2
> 16:29:25.834871 nodea dlm_controld[3466]: 1183 cluster is down, exiting
> 16:29:25.834886 nodea dlm_controld[3466]: 1183 process_cluster
> quorum_dispatch 2
> 16:29:25.834903 nodea dlm_controld[3466]: 1183 daemon cpg_dispatch error 2
> 16:29:25.834917 nodea dlm_controld[3466]: 1183 cpg_dispatch error 2
> 16:29:25.837152 nodea dlm_controld[3466]: 1183 abandoned lockspace mygfs2
> 
> resulting in both corosync and dlm_controld exiting on node A.
> 
> Later, if I try to manually restart corosync and dlm on node A, I see
> the following:
> 
> 16:32:08.382871 nodea dlm_controld[20483]: 2872 dlm_controld 4.0.2 started
> 16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolled
> lockspace mygfs2
> 16:32:08.392477 nodea dlm_controld[20483]: 2872 tell corosync to remove
> nodeid 85 from cluster
> 16:32:08.394965 nodea corosync[20456]:  [CFG   ] request to kill node
> 85(us=85): xxx
> 16:32:08.394998 nodea corosync[20456]:  [CFG   ] Killed by node 85:
> dlm_controld
> 
> The only way of making A rejoin the cluster is to reboot.
> 

Yes. You need to implement fencing, so that the node will automatically
be restarted when it leaves the cluster.

CHrissie

> I would be grateful if you could confirm the following statements:
>   (a) The "stateful merge" is unavoidable when node A leaves the cluster
> for longer than the token timeout then tries to rejoin.
>   (b) Killing corosync on node A is unavoidable when node B sees the
> "stateful merge".
>   (c) dlm exiting is unavoidable when corosync dies.
>   (d) Restarting corosync then dlm on node A will necessarily result in
> "found uncontrolled lockspace".
>   (e) The only way to recover from "found uncontrolled lockspace" (for a
> gfs2 lockspace) is to reboot.
> 
> I'm hoping that I'm overlooking something and that at least one of
> (a)--(e) is false! I'm not comfortable with a reboot being the only
> means of recovery when the token timeout is exceeded.
> 
> Thanks,
> Jonathan
> 

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster