Hi > 1. I think when a server goes down it should be "flagged as faulty" > and send out notifications (in this case to all the clients notifying > them not to use the downed server [I hope this removes the small hang > delay for clients 2-N], as well as via e-mail to the sysadmin). When the server dies suddenly it should tell people it dies suddenly?? Hmm, this needs some more thought... However, I guess it could be useful to have a way to take servers out of service for scheduled reasons? Perhaps this is what you meant? > 3. Then when the down server comes back and starts glusterfsd it > remains "faulty" and no client can use it. I agree with where you are going, but if the autoheal works as advertised then there is no reason to stop any client using it - it will simply self heal as soon as someone requests a file which is stale (this is at least what it's claimed to do...) > 5. A sysadmin changes the "faulty flag" to a "resync flag" ("resync > flag" tells the clients to write to the machine, but not read from it > while it recovers). > 6. A sysadmin then runs a re-sync (ls -alR). > 7. Once the re-sync completes a sysadmin runs a "re-add" command > removing the "faulty flag" and the clients can begin using the server > again. I do agree that it would be very helpful to have an idea of whether servers are properly in sync or not though. Consider the scenario of upgrading a cluster, ie take down S1, upgrade it, then bring it up again, take down S2, upgrade it, then bring it up again. If you don't fully sync S1 and S2 in the middle then you have a split brain situation which must lead to data loss... Perhaps the ls -alR is 100% sufficient to guarantee the entire filesystem is synced and hence is completely sufficient, but split brain IS the major fear with clustered systems and it would be nice to have even stronger guarantees of consistency... > I feel that this method removes the chance that a server goes down, > gets out of sync, recovers on its own (or though automated tools), and > starts providing services with some old data. > In the middle of the night if the server goes down, and nagios trips a > reboot, then the server comes up, no sysadmin is logged in to run the > "ls -alR" to get the server to re-sync. Yeah, I agree that this scenario is scary. Actually you missed out an implied step which is if the *other* server dies before the resync happens then you have a risk of split brain. Arguably it's not necessary to fence the recovering server during the recovery, but you definitely want to fence it if cannot completely resync for some reason... Ed W