Re: solutions for split brain situation

Stephan von Krawczynski <skraw@xxxxxxxxxx> · Mon, 14 Sep 2009 17:34:00 +0200

On Mon, 14 Sep 2009 10:33:13 -0400
Mark Mielke <mark@xxxxxxxxxxxxxx> wrote:

> On 09/14/2009 10:18 AM, Stephan von Krawczynski wrote:
> > Our "split brain" is no real split brain and looks like this: Logfiles are
> > written every 5 mins. If you add a secondary server that has 14 days old
> > logfiles on it you notice that about half of your data vanishes while not
> > successful self heal is performed, because the old logfiles read from the
> > secondary server overwrite the new logfiles on your primary while new data is
> > added to them. This is a very simple and solvable situation. All you had to do
> > to win the situation 100% is to compare the files' mtime.
> >    
> 
> I'd argue that "longest running glusterfsd" is the right self-heal 
> solution here as well (as opposed to "latest mtime"). The secondary 
> server should come up with zero uptime, and have the lowest precedence.

See other post why this is no good in situations with multiple clients...

> > Whereas a true split brain is rare, our situation arises every time you add a
> > server, maybe because you made a kernel update or needed a reboot for some
> > reason. Your secondary comes back and kicks your ass. Even better, it is
> > completely irrelevant which server gets re-added, as soon as you have old data
> > on it you are busted.
> >    
> 
> This is interesting. I wondered about this reading the documentation. I 
> came to the conclusion that there is supposed to be some sort of version 
> attribute attached to the file that will resolve this situation. Are you 
> doing something special - such as removing the volume from your 
> configuration, and then re-adding it? I don't know how it works - but if 
> the system simply goes down for a period - for kernel update or reboot - 
> I am lead to believe that everything should be fine. Have you tried this?

I have and it failed. In fact the file access is simply distributed among the
servers and you get all files damaged that reside on older filesets.
This is a "bad thing"(tm).

> > You might argue to prevent that by simply deleting everything on a newly added
> > server. But if you deal with TBs of data you really do not want to spend the
> > time and network bandwidth to heal the data, when most of it is actually in
> > good shape and only some MBs or GBs are outdated.
> > Btw I know this is not what you call "split brain", but glusterfs thinks it
> > is, and that is part of the problem. It cannot distinguish the cases.
> > Your argument is broken anyways because in your situation you will loose the
> > data no matter if you keep the current implementation or create a new "option
> > favorite-child mtime" option. In the current implementation you will loose
> > about every other file content summing up to 100% of the files being damaged,
> > iff in a true split brain both servers get new data for their respective
> > fileset and are mixed together later on. If the file comes from server A you
> > lost all data added on server B during split brain and vice versa.
> > Thinking about it it sounds as if the current implementation is the worst
> > possible. There is really no good reason for distributing file access in a
> > split brain detect situation. At least it should then choose the same child
> > for following file access to prevent the 100% loss.
> > Another idea would be switching split-brain files to read-only access. This
> > would be the conservative approach of not loosing already written data - only
> > new writes get lost this way.
> >    
> 
> I think you want the "hot add / hot remove" functionality on the roadmap 
> for 2.2. If you are removing the volume while the system is down, then 
> you are using it outside the design case for the solution at present.
> 
> I agree it should work - eventually - but I think your use is outside of 
> intended scope at the moment.

I currently do not see a big problem here. In fact, if you really have a
possibility to choose the latest file-copy for self-heal you are safe for a
lot of strange cases, including the one where you export parts of the
gluster-exported files from the server by (kernel-)nfs to other clients. If
they fiddle around with the files on one server, the next glusterfs-client
touching them is self-healing them over the gluster infrastructure quite
automagically. So you can do that without harming your data. This only sounds
like an absurd config on first sight, but if you think of migration from nfs
to gluster you will meet this kind of situation quite immediately.

> Now, if your system is going down - no removal of the volume - and it 
> comes back up with the behaviour you describe, then I am very concerned 
> as well.

I can tell you for sure that this is the case, we fell into this hole already
twice, shot down around 14 days of logs (half of course because of the
distributing file access).
This is really a hotspot that should be dealt with as soon as possible. The
only current solution is to delete everything before re-adding volumes.

> My opinion, anyways.
> 
> Cheers,
> mark

-- 
Regards,
Stephan