Re: solutions for split brain situation

Mark Mielke <mark@xxxxxxxxxxxxxx> · Thu, 17 Sep 2009 20:27:13 -0400

On 09/17/2009 06:47 PM, Stephan von Krawczynski wrote:
Way above in this discussion I told that we only talk about the first/primary
subvolume/backend for simplicity. It makes no sense to check a journal if I
can stat the real file which I have to do anyway if an open/create arrives -
and we are talking exactly about that. So please explain where is your assumed
race? Really only a braindead implementation can race on an open. You can
delay a flush on close (like writebehind), but you can obviously not delay an
open neither r,rw nor create because you have to know if the file is a)
existing and b) can be created if not. As long as you don't touch the backend
you will not find out if a create may fail for disk-full or the like. It may
as well fail because of access-privileges. whatever it is, you will not find a
trusted answer without asking the backend, no journal will save you.

Like most backend storages, the backend storage includes the data pages, 
the metadata, AND the journal. "Without asking the backend" and "no 
journal will save you" are not not understanding that the backend 
*includes* the journal.

A scenario which should make this clear: Let's say the file a.c is 
removed a from a 2-node replication cluster. Something like the 
following should occur: Step 1 is to lock the resource. Step 2 is to 
record the intent to remove on each node. Step 3 is to remove on each 
node. Step 4 is to clear the intent from each node. Step 5 is to unlock 
the resource. Now, let's say that one node is not accessible during this 
process and it comes back up later. After it comes back up, should a 
process that happens to see the file does not exist on node 1, but does 
exist on node 2. Should the file exist or not? I don't know if GlusterFS 
even does this correctly - but if it does, the file should NOT exist. 
There should be sufficient information, probably in the journal, to show 
that the file was *removed*, and therefore, even if one node still has 
the file, the journal tells us that the file was removed. The self-heal 
operation should remove the file from the node that was down as soon as 
the discrepancy is detected.

The point here, is that the journal SHOULD be consulted. If you think 
otherwise, I think you are not looking for a reliable replication 
cluster that implements POSIX guarantees.

I think GlusterFS doesn't provide all of these guarantees as well as it 
should, but I have not done the full testing to expose how correct or 
incorrect it is in various cases. As it is, I just received a problem 
where a Java program trying to use file locking failed in a GlusterFS 
mount point, but succeeded in /var/tmp, so although I still think 
GlusterFS has potentially - I'm slowly backing down from what production 
data I am willing to store in it. It's unfortunate that this solution 
space seems so immature. I'm still switching back and forth between 
wondering if I should push / help GlusterFS into solving all of the 
problems, or just write my own solution.

My favourite solution is a mostly asynchronous master-master approach, 
where each node can fall out of date from the other, as long as they 
touch different data, but that changes that do touch the same data 
become serialized. Unfortunately, this also requires the most clever 
implementation strategy as well, and clever can take time or exceptional 
talent.

Read again: I said "and not going over glusterfs for some unknown reason."
"unkown reason" means that I can think of some for myself but tend to believe
there may be lots of others. My personal reason nr 1 is the soft migration
situation.

See my comment about writing a program to set up the xattr metadata for you

How about using the code that is there - inside glusterfsd.
It must be there, else you would not be able to mount an already populated
backend for the first time. Did you try? I did.

This could mean that GlusterFS is too lax with regard to consistency 
guarantees. If files can appear in the background, and magically be 
shown - this indicates that GlusterFS is not enforcing use through the 
mount point, which introduces the potential for inconsistent or faulty 
results. You are asking for it to guess what you want, without seeing 
that what you are asking for is incompatible with provisions for any 
guarantee of a consistent view. That "it works" is actually more 
concerning to me that justifying over your position. To me it says it's 
one more potential problem that I might hit in the future. A file that 
should be removed magically re-appears - how is this a good thing?

Cheers,
mark

--
Mark Mielke<mark@xxxxxxxxx>