Rejoin cluster after failure without reboot?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I'm experimenting with corosync+dlm+gfs2 (approximately following http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt) and am trying to establish whether it meets my requirements. I have a query about a node rejoining a cluster after failure, and want to make sure I'm not overlooking something.

I have a three-node cluster and deliberately cause token loss by firewalling one of them (call it node A) out of the network for longer than the token timeout. At this point, the other two hosts (B and C) decide that A has disappeared and continue with quorum. That is fine.

When I unfirewall node A, dlm tries to reconnect to its peers on B and C. But then I see the following on host B:

16:29:25.823496 nodeb dlm_controld[6548]: 908 daemon node 85 stateful merge
16:29:25.823529 nodeb dlm_controld[6548]: 908 daemon node 85 kill due to stateful merge 16:29:25.823543 nodeb dlm_controld[6548]: 908 tell corosync to remove nodeid 85 from cluster 16:29:25.823696 nodeb corosync[6536]: [CFG ] request to kill node 85(us=83): xxx

and then the following on node A:

16:29:25.828547 nodea corosync[3896]: [CFG ] Killed by node 83: dlm_controld 16:29:25.828575 nodea corosync[3896]: [MAIN ] Corosync Cluster Engine exiting with status -1 at cfg.c:530. 16:29:25.834828 nodea dlm_controld[3466]: 1183 process_cluster_cfg cfg_dispatch 2
16:29:25.834871 nodea dlm_controld[3466]: 1183 cluster is down, exiting
16:29:25.834886 nodea dlm_controld[3466]: 1183 process_cluster quorum_dispatch 2
16:29:25.834903 nodea dlm_controld[3466]: 1183 daemon cpg_dispatch error 2
16:29:25.834917 nodea dlm_controld[3466]: 1183 cpg_dispatch error 2
16:29:25.837152 nodea dlm_controld[3466]: 1183 abandoned lockspace mygfs2

resulting in both corosync and dlm_controld exiting on node A.

Later, if I try to manually restart corosync and dlm on node A, I see the following:

16:32:08.382871 nodea dlm_controld[20483]: 2872 dlm_controld 4.0.2 started
16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolled lockspace mygfs2 16:32:08.392477 nodea dlm_controld[20483]: 2872 tell corosync to remove nodeid 85 from cluster 16:32:08.394965 nodea corosync[20456]: [CFG ] request to kill node 85(us=85): xxx 16:32:08.394998 nodea corosync[20456]: [CFG ] Killed by node 85: dlm_controld

The only way of making A rejoin the cluster is to reboot.

I would be grateful if you could confirm the following statements:
(a) The "stateful merge" is unavoidable when node A leaves the cluster for longer than the token timeout then tries to rejoin. (b) Killing corosync on node A is unavoidable when node B sees the "stateful merge".
  (c) dlm exiting is unavoidable when corosync dies.
(d) Restarting corosync then dlm on node A will necessarily result in "found uncontrolled lockspace". (e) The only way to recover from "found uncontrolled lockspace" (for a gfs2 lockspace) is to reboot.

I'm hoping that I'm overlooking something and that at least one of (a)--(e) is false! I'm not comfortable with a reboot being the only means of recovery when the token timeout is exceeded.

Thanks,
Jonathan

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster



[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux