Re: Rejoin cluster after failure without reboot?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 25, 2015 at 03:22:18PM +0000, Jonathan Davies wrote:
> 16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolled
> lockspace mygfs2

> The only way of making A rejoin the cluster is to reboot.

That's expected because we don't have the ability to clear the dlm and
gfs2 kernel state that was left behind.  Reboot is the only way to clear
that.

> I would be grateful if you could confirm the following statements:
>   (a) The "stateful merge" is unavoidable when node A leaves the
> cluster for longer than the token timeout then tries to rejoin.

correct

>   (b) Killing corosync on node A is unavoidable when node B sees the
> "stateful merge".

correct

>   (c) dlm exiting is unavoidable when corosync dies.

correct

>   (d) Restarting corosync then dlm on node A will necessarily result
> in "found uncontrolled lockspace".

correct

>   (e) The only way to recover from "found uncontrolled lockspace"
> (for a gfs2 lockspace) is to reboot.

correct

> I'm hoping that I'm overlooking something and that at least one of
> (a)--(e) is false! I'm not comfortable with a reboot being the only
> means of recovery when the token timeout is exceeded.

It's the nature of the beast I'm afraid -- an effect of the extremely
tight coupling of components that's needed to make gfs2 semantics as near
as possible to those of a local fs.  File systems willing to diverge a
little more from local fs behavior are generally more forgiving.

Dave

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster



[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux