How to re-sync

lists at wildgooses.com (Ed W) · Wed, 10 Mar 2010 17:41:36 +0000

Hi

> 1. I think when a server goes down it should be "flagged as faulty" 
> and send out notifications (in this case to all the clients notifying 
> them not to use the downed server [I hope this removes the small hang 
> delay for clients 2-N], as well as via e-mail to the sysadmin).

When the server dies suddenly it should tell people it dies suddenly?? 
Hmm, this needs some more thought...

However, I guess it could be useful to have a way to take servers out of 
service for scheduled reasons?  Perhaps this is what you meant?

> 3. Then when the down server comes back and starts glusterfsd it 
> remains "faulty" and no client can use it.

I agree with where you are going, but if the autoheal works as 
advertised then there is no reason to stop any client using it - it will 
simply self heal as soon as someone requests a file which is stale (this 
is at  least what it's claimed to do...)

> 5. A sysadmin changes the "faulty flag" to a "resync flag" ("resync 
> flag" tells the clients to write to the machine, but not read from it 
> while it recovers).
> 6. A sysadmin then runs a re-sync (ls -alR).
> 7. Once the re-sync completes a sysadmin runs a "re-add" command 
> removing the "faulty flag" and the clients can begin using the server 
> again.

I do agree that it would be very helpful to have an idea of whether 
servers are properly in sync or not though.

Consider the scenario of upgrading a cluster, ie take down S1, upgrade 
it, then bring it up again, take down S2, upgrade it, then bring it up 
again.  If you don't fully sync S1 and S2 in the middle then you have a 
split brain situation which must lead to data loss...

Perhaps the ls -alR is 100% sufficient to guarantee the entire 
filesystem is synced and hence is completely sufficient, but split brain 
IS the major fear with clustered systems and it would be nice to have 
even stronger guarantees of consistency...

> I feel that this method removes the chance that a server goes down, 
> gets out of sync, recovers on its own (or though automated tools), and 
> starts providing services with some old data.
> In the middle of the night if the server goes down, and nagios trips a 
> reboot, then the server comes up, no sysadmin is logged in to run the 
> "ls -alR" to get the server to re-sync.

Yeah, I agree that this scenario is scary.  Actually you missed out an 
implied step which is if the *other* server dies before the resync 
happens then you have a risk of split brain.

Arguably it's not necessary to fence the recovering server during the 
recovery, but you definitely want to fence it if cannot completely 
resync for some reason...

Ed W