Re: solutions for split brain situation

Stephan von Krawczynski <skraw@xxxxxxxxxx> · Tue, 15 Sep 2009 01:28:39 +0200

On Mon, 14 Sep 2009 14:01:50 -0500 (CDT)
Anand Avati <avati@xxxxxxxxxxx> wrote:

> 
> > the situation was really very simple. We drove a simple replicate setup with
> > one client and two servers with one server down for testing.
> > This went for about 10 days. In the background we rsync'ed the second server
> > from an old backup (some TB) hoping that self-heal will go a lot faster if
> > only the few new files have to be replicated.
> > Then we switched on glusterfsd and read this in the client logs:
> > 
> > [2009-09-12 16:59:48] N [client-protocol.c:5559:client_setvolume_cbk] remote2:
> > Connected to 192.168.82.2:6996, attached to remote volume 'p3user'.
> > [2009-09-12 16:59:48] N [client-protocol.c:5559:client_setvolume_cbk] remote2:
> > Connected to 192.168.82.2:6996, attached to remote volume 'p3user'.
> > [2009-09-12 17:00:03] E [afr-self-heal-data.c:858:afr_sh_data_fix] replicate:
> > Unable to self-heal contents of 'XXX' (possible split-brain). Please delete
> > the file from all but the preferred subvolume.
> > [2009-09-12 17:00:03] E [afr-self-heal-data.c:858:afr_sh_data_fix] replicate:
> > Unable to self-heal contents of 'YYY' (possible split-brain). Please delete
> > the file from all but the preferred subvolume.
> > [2009-09-12 17:00:03] E [afr-self-heal-data.c:858:afr_sh_data_fix] replicate:
> > Unable to self-heal contents of 'ZZZ' (possible split-brain). Please delete
> > the file from all but the preferred subvolume.
> > 
> > And so on...
> > 
> > These were files that were written at this point, and their content got
> > overwritten by older versions of the same file that resided on server remote2.
> > As told this happened to us before, but we did not fully understand the facts
> > back then. Now we know, no way round deleting before adding...
> 
> Was this "old backup" one of the subvolumes of replicate previously? which was probably populated with some data when the other subvolume was down?
> 
> Avati

I have problems in understanding exactly the heart of your question. Since the
main reason for rsyncing the data was to take a backup of the primary server,
so that self heal would have less to do it is obvious (to me) that it has been
a subvolume of the replicate. In fact is was a backup of the _only_ subvolume
(remember we configured a replicate with two servers, where one of them was
actually not there until we offline fed it with the active servers' data and
then tried to switch it online in glusterfs.
Which means: of course the "old backup" was a subvolume and of course it was
populated with some data when the other subvolume was down. Let me again
describe step by step:

1) setup design: one client, two servers with replicate subvolumes
2) switch on server 1 and client
3) copy data from client to glusterfs exported server dir
4) rsync exported server dir from server 1 to server 2 (server 2 is not
running glusterfsd at that time)
5) copy some more data from client to glusterfs exported server dir (server 2
still offline). "more data" means files with same names as in step 3) but
other content.
6) bring server 2 with partially old data online by starting glusterfsd
7) client sees server2 for the first time
8) read in the "more data" from step 5 and therefore get split brain error
messages in clients log
9) write back the "more data" again and then watch the content => 
10) based on where the data was read (server1 or server2) from glusterfs
client the file content is from step 5) (server 1 was read) or step 3) (server
2 was read). 

result:the data from step 3 is outdated because glusterfs failed to notice
that the same files (filenames) existing on server 1 and 2 are indeed new
(server1) and old (server2) and therefore _only_ files from server1 should
have been favorite copies. glusterfs could have notices this by simply
comparing the file copies mtimes. But it thinks this was split brain - which it
was not, it was simply a secondary server being brought up for the very first
time with some backup'ed fileset - and damages the data by distributing the
reads between server 1 and 2.

-- 
Regards,
Stephan