--- Krishna Srinivas <krishna@xxxxxxxxxxxxx> wrote: > > I am curious, is client side AFR susceptible > > to race conditions on writes? If not, how is this > > mitigated? > This is a known issue with the client side AFR. Ah, OK. Perhaps it is already documented somewhere, but I can't help but think that perhaps the AFR translator deserves a page dedicated to some of the design trade offs made and the impact the they have. With enough thought, it is possible to deduce/guess at some of the potential problems such as split brain and race conditions, but for most of us this is still a guess until we ask on the list. Perhaps with the help of others I will setup a wiki page for this. This kind of documented info would probably help situations like the one with Garreth where he felt mislead by the glusterfs documentation. > We can solve this by locking but there will be > performance hit. Of course if applications lock > themselves then all will be fine. I feel we can have > it as an option to disable the locking > in case users are more concerned about performance. > > Do you have any suggestions? I haven't given it a lot of thought, but, how would the locking work? Would you be doing: SubA AFR application SubB | | | | | |<---write---| | | | | | |<---lock----|-----------lock--------->| |---locked-->|<---------locked---------| | | | | |<--write----|----------write--------->| |--written-->|<--------written---------| | | | | |<--unlock---|----------unlock-------->| |--unlocked->|<--------unlocked--------| | | | | | |---written->| | because that does seem to be a rather large 3 roundtrip latency versus the current single rountrip, not including all the lock contention performance hits! This solution also has the problem of lock recovery if a client dies. If instead, a rank (which could be configurable or random) were given to each subvolume on startup, one alternative would be to always write to the highest ranking subvolume first: (A is a higher rank than B) SubA AFR Application SubB | | | | | |<----write-----| | |<--write---| | | |--version->| | | | |----written--->| | | | | | | |----------(quick)heal--------->| | |<------------healed------------| The quick heal would essentially be the write but knowing/enforcing the version # returned from the SubA write. Since all clients would always have to write to SubA first, then SubA's ordering would be reflected on every subvolume. While this solution leaves a potentially larger time when SubB is unsynced, this should maintain the single roundtrip latency from an application's standpoint and avoid any lock contention performance hits? If a client dies in this scenario, any other client could always heal SubB from SubA, no lock recovery problems. Both of these solutions could probably be greatly enhanced with a write ahead log translator or some form of buffering above each subvolume, this would decrease the latency by allowing the write data to be transferred before/while the lock/ordering info is synchronized. But this may be rather complicated? However, as is, they both seem like fairly simple solutions without too much of a design change? The non locking approach seems a little odd at first and may be more of a change to the current AFR method conceptually, but the more I think about it, the more it seems appealing. Perhaps it would not actually even be a big coding change? I can't help but think that this method could also potentially be useful to eliminate more splitbrain situations, but I haven't worked that out yet. There is a somewhat suttle reason, but it makes sense that the locking solution is slower since locking enforces serialization across all the writes. This serialization is not really what is needed; we only need to ensure that the potentially unserialized ordering is the same on both subvolumes. Thoughts? -Martin P.S. Simple ascii diagrams generated with: http://www.theficks.name/test/Content/pmwiki.php?n=Sdml.HomePage ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ