How to re-sync

tejas at gluster.com (Tejas N. Bhise) · Mon, 8 Mar 2010 23:45:05 -0600 (CST)

Ed, Chad, Stephen,

We believe we have fixed all ( known ) problems with self-heal 
in the latest releases and hence we would be very interested in 
getting diagnostics if you can reproduce the problem or see it 
again frequently. 

Please collect the logs by running the client and servers with 
log level TRACE and then reproducing the problem.

Also collect the backend extended attributes of the file on 
both servers before self-heal was triggered. This command can be 
used to get that info:

# getfattr -d -m '.*' -e hex <filename>

Thank you for you help to debug if any new problem shows up.
Feel free to ask if you have any queries about this.

Regards,
Tejas.

----- Original Message -----
From: "Ed W" <lists at wildgooses.com>
To: "Gluster Users" <gluster-users at gluster.org>
Sent: Monday, March 8, 2010 5:22:40 PM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi
Subject: Re: How to re-sync

On 07/03/2010 16:02, Chad wrote:
> Is there a gluster developer out there working on this problem 
> specifically?
> Could we add some kind of "sync done" command that has to be run 
> manually and until it is the failed node is not used?
> The bottom line for me is that I would much rather run on a 
> performance degraded array until a sysadmin intervenes, than loose any 
> data.

I'm only in evaluation mode at the moment, but resolving split brain is 
something which is terrifying me at the moment and I have been giving 
some thought to how it needs to be done with various solutions

In the case of gluster it really does seem very important to figure out 
a reliable way to know when the system is fully synced again if you have 
had an outage.  For example a not unrealistic situation if you were 
doing a bunch of upgrades would be:

- Turn off server 1 (S1) and upgrade, server 2 (S2) deviates from S1
- Turn on server 1 and expect to sync all new changes from while we were 
down - key expectation here is that S1 only includes changes from S2 and 
never sends changes.
- Some event marks sync complete so that we can turn off S2 and upgrade it

The problem otherwise if you don't do the sync is that you turn off S2 
and now S1 doesn't know about changes made while it's off and serves up 
incomplete information.  Split brain can occur where a file is changed 
on both servers while they couldn't talk to each other and then changes 
must be lost...

I suppose a really cool translator could be written to track changes 
made to an AFR group where one member is missing and then the out of 
sync file list would be resupplied once it was turned on again in order 
to speed up replication... Kind of a lot of work for a small 
improvement, but could be interesting to create...

Perhaps some dev has some other suggestions on a "procedure" to follow 
to avoid split brain in the situation that we need to turn off all 
servers one by one in an AFR group?

Thanks

Ed W

_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users