Stephan von Krawczynski wrote:
On Thu, 25 Mar 2010 09:56:24 +0000
Gordan Bobic <gordan@xxxxxxxxxx> wrote:
If I have your mentioned scenario right, including what you believe
should happen:
* First node goes down. Simple enough.
* Second node has new file operations performed on it that the first
node does not get.
* First node comes up. It is completely fenced from all other
machines to get itself in sync with the second node.
* Second node goes down. Is it before/after first node is synced?
o If it is before then you have a fully isolated FS that is
not accessible.
o If it is after then you don't have a problem.
I would suggest writing a script and performing some firewalling to
perform the fencing.
This is not really good enough - you need an out-of-band fencing device
that you can use to forcibly down the node that disconnected, e.g.
remote power-off by power management (e.g. UPS or a network controllable
power bar) or remote server management (Dell DRAC, Raritan eRIC G4, HP
iLO, Sun LOM, etc.). When the node gets rebooted, it has to notice there
are other nodes already up and specifically set itself into such a mode
that it will lose any contest on being the source node for resync until
it has fully checked all the files' metadata against it's peers.
I believe you can run ls -R on the file-system to
get it in sync. You would need to mount glfs locally on the first node,
get it in sync, then open the firewall ports afterward. Is that an
appropriate solution?
The problem is that firewalling would have to be applied by every node
other than the node that dropped off, and this would need to be
communicated to all the other nodes, and they would have to confirm
before the fencing action is deemed to have succeeded. This is a lot
more complex and error prone compared to just using a single point of
fencing for each node such as a network controlled power bar.
(e.g.
http://www.linuxfordevices.com/c/a/News/Entrylevel-4port-IP-power-switch-runs-Linux/
)
Let me add some thoughts here:
First it looks obvious to me that fencing is not needed for glusterfs in the
described cases. If your first node comes up again it will not deliver data
that is not in-sync with the second node, that is what glusterfs is all about.
Not quite - there are a lot of failure modes that involve network
partitioning that WILL cause split-brain and unhealable files.
Now, when your second nodes goes down while the first is not completely synced
you only have these choices:
1. Blow up the setup and deliver nothing
2. Deliver what the first node actually has.
It looks obvious that the second choice is preferred because whatever the
out-of-sync data is, there is likely in-sync data too to be served. And so you
are at least partly saved.
But are opening yourself to the prospect of having files that cannot be
healed. I can think of plenty of cases where this is a worse case
scenario than just blocking/fencing.
You are also forgetting that the failure mode you are describing
involves a previous failure, too. If A isn't in sync with B and B goes
down, that means A went down first, but came back up.
The real hot topic here is how the time between the first node coming back and
the second node going down is used for an optimal self heal procedure. The
risk of split brain is lower the faster the self heal procedure works.
I'd say that any risk of split brain needs to be suitably addressed. A
solution that includes fencing (to prevent split-brain from occurring in
the first place) plus keeping a separate list of files that are "dirty"
so they can be resynced explicitly before a node is allowed to fully
re-join might be a reasonable way to go. This is similar to what DRBD
does (it keeps a bitmap of dirty blocks for fast resync).
It is obvious that the optimal strategy has to know exactly what files to
heal. And I just made a proposal for that in another post.
Doing ls -lR will be no good strategy for simple runtime reasons if you have
large amounts of data.
I agree, although I'm pretty sure there can be failure modes where it is
necessary. Then again, if you have that big a data set, you should be
partitioning it in smaller RAID1 stripes with RAID0 stripes on top. That
way the time to resync any server to it's peer is kept manageable.
Simply running a 100TB mirror isn't sensible. Keeping 100 1TB mirrors is
much more workable cometh resync time.
Gordan