> When the server dies suddenly it should tell people it dies suddenly?? > Hmm, this needs some more thought... Not the server that went down, the first client that notices (times out on the server) should send out the notifications. I thought this note made that clear, my bad. >> (in this case to all the clients notifying them not to use the downed server [I hope this removes the small hang delay for clients 2-N], as well as via e-mail to the sysadmin). ^C Ed W wrote: > Hi > >> 1. I think when a server goes down it should be "flagged as faulty" >> and send out notifications (in this case to all the clients notifying >> them not to use the downed server [I hope this removes the small hang >> delay for clients 2-N], as well as via e-mail to the sysadmin). > > When the server dies suddenly it should tell people it dies suddenly?? > Hmm, this needs some more thought... > > However, I guess it could be useful to have a way to take servers out of > service for scheduled reasons? Perhaps this is what you meant? > >> 3. Then when the down server comes back and starts glusterfsd it >> remains "faulty" and no client can use it. > > I agree with where you are going, but if the autoheal works as > advertised then there is no reason to stop any client using it - it will > simply self heal as soon as someone requests a file which is stale (this > is at least what it's claimed to do...) > >> 5. A sysadmin changes the "faulty flag" to a "resync flag" ("resync >> flag" tells the clients to write to the machine, but not read from it >> while it recovers). >> 6. A sysadmin then runs a re-sync (ls -alR). >> 7. Once the re-sync completes a sysadmin runs a "re-add" command >> removing the "faulty flag" and the clients can begin using the server >> again. > > I do agree that it would be very helpful to have an idea of whether > servers are properly in sync or not though. > > Consider the scenario of upgrading a cluster, ie take down S1, upgrade > it, then bring it up again, take down S2, upgrade it, then bring it up > again. If you don't fully sync S1 and S2 in the middle then you have a > split brain situation which must lead to data loss... > > Perhaps the ls -alR is 100% sufficient to guarantee the entire > filesystem is synced and hence is completely sufficient, but split brain > IS the major fear with clustered systems and it would be nice to have > even stronger guarantees of consistency... > > >> I feel that this method removes the chance that a server goes down, >> gets out of sync, recovers on its own (or though automated tools), and >> starts providing services with some old data. >> In the middle of the night if the server goes down, and nagios trips a >> reboot, then the server comes up, no sysadmin is logged in to run the >> "ls -alR" to get the server to re-sync. > > Yeah, I agree that this scenario is scary. Actually you missed out an > implied step which is if the *other* server dies before the resync > happens then you have a risk of split brain. > > Arguably it's not necessary to fence the recovering server during the > recovery, but you definitely want to fence it if cannot completely > resync for some reason... > > > Ed W > >