Re: solutions for split brain situation

Stephan von Krawczynski <skraw@xxxxxxxxxx> · Fri, 18 Sep 2009 00:47:13 +0200

On Thu, 17 Sep 2009 16:42:47 +0100
Gordan Bobic <gordan@xxxxxxxxxx> wrote:

> 
> >> 1) There is a race condition in what you describe. Since you mentioned 30 years in development, I assume you know what that means. Consider this:
> >> You are "locally feeding" file "x" on server1. During this, the same file gets created via the mountpoint on server2. What would you expect to happen, in a fs that aims for full posix compliance on atomic operations?
> 
> > Sorry for a simple logic I took for granted: if I create a file on a fs and
> > the fs finds I can do that, I feel ok. If glusterfsd creates a file on a fs
> > and the fs tells it it is ok, it should. We cannot meet at the same time,
> > because there is no "same time" for fs requests. First request will win.
> > My hope is that glusterfs is not "creating" files on the client side without
> > having checked the backend storage for their existence or absence.
> > Is that assumption incorrect?
>  
> Which backend? There could be many servers. Gluster will check parent directory journal to make sure it is consistent, lock, write, unlock. Checking and comparing the whole directory content would be prohibitively expensive, especially for large directories over a WAN.

Way above in this discussion I told that we only talk about the first/primary
subvolume/backend for simplicity. It makes no sense to check a journal if I
can stat the real file which I have to do anyway if an open/create arrives -
and we are talking exactly about that. So please explain where is your assumed
race? Really only a braindead implementation can race on an open. You can
delay a flush on close (like writebehind), but you can obviously not delay an
open neither r,rw nor create because you have to know if the file is a)
existing and b) can be created if not. As long as you don't touch the backend
you will not find out if a create may fail for disk-full or the like. It may
as well fail because of access-privileges. whatever it is, you will not find a
trusted answer without asking the backend, no journal will save you. 

> > > 2) The example you give doesn't, in any way, provide justification for not copying the file in via the mountpoint in the first place.
> 
> > Read again: I said "and not going over glusterfs for some unknown reason." 
> > "unkown reason" means that I can think of some for myself but tend to believe
> > there may be lots of others. My personal reason nr 1 is the soft migration
> > situation.
> 
> See my comment about writing a program to set up the xattr metadata for you

How about using the code that is there - inside glusterfsd.
It must be there, else you would not be able to mount an already populated
backend for the first time. Did you try? I did.

-- 
Regards,
Stephan