sorry, forgot to CC to gluster-devel list ---------- Forwarded message ---------- From: Alexey Filin <alexey.filin@xxxxxxxxx> Date: Oct 23, 2007 5:11 PM Subject: Re: proposals to afr To: Krishna Srinivas <krishna@xxxxxxxxxxxxx> Hi Krishna, On 10/23/07, Krishna Srinivas <krishna@xxxxxxxxxxxxx> wrote: > > Alexy, > > I could not exactly understand the algorithm and what it is fixing :( can > you explain once again? yes, of course, thanks for answer :) AFR does not differentiate bettween its children as master/slaves > when it writes the data, it sends write() operation simultaneously > to all the children. > Lets call it first child, 2nd child etc instead of master/slave. I call master the afr xlator node (it may be glfs-client- or glfs-server-side) and call its children the slaves, because each operation goes through only one afr xlator node and many childs. Simultaneously if children latencies (which exist already) don't matter. Under high load and/or for geographically dispersed children latencies are to be considered. more inline.... > > On 10/22/07, Kevan Benson < kbenson@xxxxxxxxxxxxxxx> wrote: > > Alexey Filin wrote: > > > Hi, > > > > > > may I propose some ideas to be implemented inside afs to increase its > > > reliability? > > > > > > * First idea: an extra extented attribute named e.g. afr_op_counter > provides > > > info about operations performed currently over file, so operations > changing > > > a file's (meta)data are done in a way: > > > > > > 1) afr_master.increase_afr_op_counter <for file in namespace> > > > 2) real operation over file (meta)data > > > 3) afr_master.start_op -> afr_slave.increase_afr_op_counter <for file > on a > > > slave> > > > 4) loop over all slaves by 2)-3) > > > > > > during close(): > > > > > > 1) afr_master.zero_op -> afr_slave.zero_afr_op_counter <for file on a > slave> > > > 2) loop over all slaves by 1) > > > 3) afr_master.zero_afr_op_counter <for file in namespace> > > > > > > with the scheme all operations finished incorrectly are disclosed in a > > > simple and fast way (with non-zero counter), that scheme is not > replacing to > > > afr version xattr, it is a complement allowing to find inconsistent > replicas > > > when close() doesn't update the xattr on slaves due to afr master > crash > > > > > > > Hmm, sort of like a trusted_afr_version minor number, that gets set > > while in an operation. Essentially equivalent to taking a file with an > > afr version of 3 and making it 3.5 for the duration of the operation, > > and 4 on close. Any files on slaves that show they are in an op but no > > operationis actually in place need to be self-healed. Sounds good to > > me, but then again, I'm not a GlusterFS dev. ;) > > What situation are we trying to handle here that is not handled in the > way it works now? > > There is one situation, suppose first write fails on 1st child(crash) > and succeeds on > the 2nd child, (second write will not happen on 1st child) second write > fails on the 2nd child(crash), now close will not increment version on > both the children so the data is inconsistent but the version number > remain same. we can handle this, when a write fails from one of the > child, increment version on all the other children so that during the next > open() we sync it. correctly, if afr xlator node doesn't crash during writing. If it crashes the close() is not issued at all and version attribute is not updated (according to your description). If children update version without assistance as a workaround after afr xlator node crash, the new versions of replicas are equal but data can be different (because operations are queued in interconnect and issued sequentially as datagrams come out). Such a situation can't occur if every operation is atomic relative to the version attribute i.e. the attribute is updated instantly after every operation. I'll be happy if don't know something what helps to handle the situation correctly in current implementation. Regards, Alexey.