Re: Clusterbehaviour if one node is not reachable & fenceable any longer?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 30/01/14 07:00 AM, Nicolas Kukolja wrote:
Digimer <lists <at> alteeve.ca> writes:

And this is the fundamental problem of stretch/geo-clusters.

I am loath to recommend this, because it's soooo easy to screw it up in
the heat of the moment, so please only ever do this after you are 100%
sure the other node is dead;

If you log into the 2 remaining nodes that are blocked (because of the
inability to fence), you can type 'fence_ack_manual'. That will tell the
cluster that you have manually confirmed the lost node is powered off.

Again, USE THIS VERY CAREFULLY!

It's tempting to make assumptions when you've got users and managers
yelling at you to get services back up. So much so that Red Hat dropped
'fence_manual' entirely in RHEL 6 because it was too easy to blow things
up. I can not stress it enough just how critical it is that you confirm
that the remote location is truly off before doing this. If it's still
on and you clear the fence action, then really bad things could happen
when the link returns.

digimer

Thanks a lot for your support and explanations... So I will try to explain
it to my stakeholders...

One little question is still in my mind:
If in a three nodes scenario one node is not reachable and fencable, but two
other nodes are still alive and able to communicate to each other, where is
the risc of a "split-brain" situation?

Depending on what happened at the far end, the node could be in a state where it could try to provide or access HA services before realizing it's lost quorum. Quorum only works when the node is behaving in an expected manner. If the node isn't responding, you have to assume it has entered an undefined state, in which case quorum may or may not save you.

A classic example, though it doesn't cleanly apply here I suspect, would be a node that froze mid-write to shared storage. It's not dead, it's just hung. If the other nodes decide it's dead and proceed with recovery of the shared FS and go about their business. At some point later, the hung node recovers, has no idea that time has passed so it has no reason to think it's locks are invalid or check quorum, and finishes the write it was in the middle of. You now have a corrupted FS.

Again, this probably doesn't map to your setup, but there are other scenarios where things can get equally messed up in the time between a node recovers and it realizes it's lost quorum. The only safe protection is fencing, as it puts the node into a clean state (off or fresh boot).

The "lost" third node will, if it is still running but not accessable from
the others, disable the service because it has no contact to any other
nodes, right?
So if two nodes are connected, isn't it guaranteed, that the third node is
no longer providing the service?

Nope, the only guarantee is to put it into a known state.

Quorum == useful when nodes are in a defined state
Fencing == useful when a nodes is in an undefined state.

hth

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster




[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux