Re: solutions for split brain situation

Stephan von Krawczynski <skraw@xxxxxxxxxx> · Mon, 14 Sep 2009 17:17:34 +0200

On Mon, 14 Sep 2009 10:25:40 -0400
Mark Mielke <mark@xxxxxxxxxxxxxx> wrote:

> On 09/14/2009 08:06 AM, Stephan von Krawczynski wrote:
> > we have seen several split brain situations and think that the most common
> > option for the situation is simply missing. You can define a favourite child,
> > but you cannot define to use the latest file copy as definitive. Why not?
> > Isn't it a logical approach to say that the latest copy of a file based on
> > mtime must be the most up-to-date and therefore being used in split brain
> > recovery?
> >    
> 
> Latest is *a* resolution, but it's probably not 100% the right answer 
> for everybody. I don't think I would use it. If the file system is 
> forked - and one client is doing one thing, and another is doing another 
> thing - there is no clear answer. Split brain in general is bad. My 
> personal conclusion on the matter is:
>      1) I want to make sure that only one server is modifying one file 
> at one time, and only cut over if the master goes down, *or*
>      2) I want to lock a majority of the servers before allowing a 
> transaction to start, such that split brain should not occur. For a 
> 3-node clusters, this means requiring 2 locks.
> 
> I don't think I would rely on self-healing of split-brain for a 
> production service. Just my opinion.

Generally you are right. We thought about this type of situation especially
because glusterfs is somewhat client-driven. So you might come up with weird
situation where one client 1 thinks server A (from two, A and B) is down,
whereas the other client 2 has both servers up. So in fact your glusterfs is
in shape for files touched by client 2, whereas it is broken for files only
touched by client 1. This btw. is the negative example for your proposal of
longest-up-is-favorite-child, because you as a (single) client cannot see the
"big picture" of the glusterfs as a whole. In this kind of situation only
latest-mtime saves you. 

> If I did want to make a "best choice", though - I think I would choose 
> "volume associated with the longest running glusterfsd including being 
> actively ping accessible". It's not perfect either, but at least it 
> maximizes the chance that this is the one the most people using would 
> have seen and made their decisions based upon.

s.a.

> > Currently it seems that there is no real choice besides a defined favourite
> > child, the file action is only distributed between the children, which means
> > you just get a subset of old file copies.
> > I'd say the solution has to be placed somewhere at
> > xlators/cluster/afr/src/afr-self-heal-data.c lines 855 ff.
> > I have no idea though how to find out what the latest copy is ...
> > Comments?
> >    
> 
> Look at the stat() results for each of the files, and track the latest 
> mtime. But, for two processes actively writing - this still rolling a 
> die. In fact, just because it's latest now, doesn't mean it is latest 2 
> seconds from now...

Do you have an idea how to code that?

> Cheers,
> mark

-- 
Regards,
Stephan