Re: solutions for split brain situation

Anand Avati <avati@xxxxxxxxxxx> · Mon, 14 Sep 2009 14:01:50 -0500 (CDT)

> the situation was really very simple. We drove a simple replicate setup with
> one client and two servers with one server down for testing.
> This went for about 10 days. In the background we rsync'ed the second server
> from an old backup (some TB) hoping that self-heal will go a lot faster if
> only the few new files have to be replicated.
> Then we switched on glusterfsd and read this in the client logs:
> 
> [2009-09-12 16:59:48] N [client-protocol.c:5559:client_setvolume_cbk] remote2:
> Connected to 192.168.82.2:6996, attached to remote volume 'p3user'.
> [2009-09-12 16:59:48] N [client-protocol.c:5559:client_setvolume_cbk] remote2:
> Connected to 192.168.82.2:6996, attached to remote volume 'p3user'.
> [2009-09-12 17:00:03] E [afr-self-heal-data.c:858:afr_sh_data_fix] replicate:
> Unable to self-heal contents of 'XXX' (possible split-brain). Please delete
> the file from all but the preferred subvolume.
> [2009-09-12 17:00:03] E [afr-self-heal-data.c:858:afr_sh_data_fix] replicate:
> Unable to self-heal contents of 'YYY' (possible split-brain). Please delete
> the file from all but the preferred subvolume.
> [2009-09-12 17:00:03] E [afr-self-heal-data.c:858:afr_sh_data_fix] replicate:
> Unable to self-heal contents of 'ZZZ' (possible split-brain). Please delete
> the file from all but the preferred subvolume.
> 
> And so on...
> 
> These were files that were written at this point, and their content got
> overwritten by older versions of the same file that resided on server remote2.
> As told this happened to us before, but we did not fully understand the facts
> back then. Now we know, no way round deleting before adding...

Was this "old backup" one of the subvolumes of replicate previously? which was probably populated with some data when the other subvolume was down?

Avati