----- Original Message ----- > From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > To: "Raghavendra G" <raghavendra@xxxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx> > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Mohit Agrawal" <moagrawa@xxxxxxxxxx> > Sent: Thursday, September 15, 2016 4:54:25 PM > Subject: Re: Query regards to heal xattr heal in dht > > > > On 15/09/16 11:31, Raghavendra G wrote: > > > > > > On Thu, Sep 15, 2016 at 12:02 PM, Nithya Balachandran > > <nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>> wrote: > > > > > > > > On 8 September 2016 at 12:02, Mohit Agrawal <moagrawa@xxxxxxxxxx > > <mailto:moagrawa@xxxxxxxxxx>> wrote: > > > > Hi All, > > > > I have one another solution to heal user xattr but before > > implement it i would like to discuss with you. > > > > Can i call function (dht_dir_xattr_heal internally it is > > calling syncop_setxattr) to heal xattr in dht_getxattr_cbk in last > > after make sure we have a valid xattr. > > In function(dht_dir_xattr_heal) it will copy blindly all user > > xattr on all subvolume or i can compare subvol xattr with valid > > xattr if there is any mismatch then i will call syncop_setxattr > > otherwise no need to call. syncop_setxattr. > > > > > > > > This can be problematic if a particular xattr is being removed - it > > might still exist on some subvols. IIUC, the heal would go and reset > > it again? > > > > One option is to use the hash subvol for the dir as the source - so > > perform xattr op on hashed subvol first and on the others only if it > > succeeds on the hashed. This does have the problem of being unable > > to set xattrs if the hashed subvol is unavailable. This might not be > > such a big deal in case of distributed replicate or distribute > > disperse volumes but will affect pure distribute. However, this way > > we can at least be reasonably certain of the correctness (leaving > > rebalance out of the picture). > > > > > > * What is the behavior of getxattr when hashed subvol is down? Should we > > succeed with values from non-hashed subvols or should we fail getxattr? > > With hashed-subvol as source of truth, its difficult to determine > > correctness of xattrs and their values when it is down. > > > > * setxattr is an inode operation (as opposed to entry operation). So, we > > cannot calculate hashed-subvol as in (get)(set)xattr, parent layout and > > "basename" is not available. This forces us to store hashed subvol in > > inode-ctx. Now, when the hashed-subvol changes we need to update these > > inode-ctxs too. > > > > What do you think about a Quorum based solution to this problem? > > > > 1. setxattr succeeds only if it is successful on at least (n/2 + 1) > > number of subvols. > > 2. getxattr succeeds only if it is successful and values match on at > > least (n/2 + 1) number of subvols. > > > > The flip-side of this solution is we are increasing the probability of > > failure of (get)(set)xattr operations as opposed to the hashed-subvol as > > source of truth solution. Or are we - how do we compare probability of > > hashed-subvol going down with probability of (n/2 + 1) nodes going down > > simultaneously? Is it 1/n vs (1/n*1/n*... (n/2+1 times)?. Is 1/n correct > > probability for _a specific subvol (hashed-subvol)_ going down (as > > opposed to _any one subvol_ going down)? > > If we suppose p to be the probability of failure of a subvolume in a > period of time (a year for example), all subvolumes have the same > probability, and we have N subvolumes, then: > > Probability of failure of hashed-subvol: p > Probability of failure of N/2 + 1 or more subvols: <attached as an image> Thanks Xavi. That was quick :). > > Note that this probability says how much probable is that N/2 + 1 > subvols or more fail in the specified period of time, but not > necessarily simultaneously. If we suppose that subvolumes are recovered > as fast as possible, the real probability of simultaneous failure will > be much smaller. > > In worst case (not recovering the failed subvolumes in the given period > of time), if p < 0.5 or N = 2 (and p != 1), then it's always better to > check N/2 + 1 subvolumes. Otherwise, it's better to check the hashed-subvol. > > I think that p should always be much smaller than 0.5 for small periods > of time where subvolume recovery could no be completed before other > failures, so checking half plus one subvols should always be the best > option in terms of probability. Performance can suffer though if some > kind of synchronization is needed. For this problem, no synchronization is needed. We need to wind the (get)(set)xattr call to all subvols though. What I didn't think through is rollback/rollforward during setxattr if the op fails on more than quorum subvols. One problem with rollback approach is that we may never get a chance to rollback at all and how do we handle racing setxattrs on the same key from different clients/apps (coupled with rollback etc). I need to think more about this. > > Xavi > > > > > > > > > > > > > > Let me know if this approach is suitable. > > > > > > > > Regards > > Mohit Agrawal > > > > On Wed, Sep 7, 2016 at 10:27 PM, Pranith Kumar Karampuri > > <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>> wrote: > > > > > > > > On Wed, Sep 7, 2016 at 9:46 PM, Mohit Agrawal > > <moagrawa@xxxxxxxxxx <mailto:moagrawa@xxxxxxxxxx>> wrote: > > > > Hi Pranith, > > > > > > In current approach i am getting list of xattr from > > first up volume and update the user attributes from that > > xattr to > > all other volumes. > > > > I have assumed first up subvol is source and rest of > > them are sink as we are doing same in dht_dir_attr_heal. > > > > > > I think first up subvol is different for different mounts as > > per my understanding, I could be wrong. > > > > > > > > Regards > > Mohit Agrawal > > > > On Wed, Sep 7, 2016 at 9:34 PM, Pranith Kumar Karampuri > > <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>> wrote: > > > > hi Mohit, > > How does dht find which subvolume has the > > correct list of xattrs? i.e. how does it determine > > which subvolume is source and which is sink? > > > > On Wed, Sep 7, 2016 at 2:35 PM, Mohit Agrawal > > <moagrawa@xxxxxxxxxx <mailto:moagrawa@xxxxxxxxxx>> > > wrote: > > > > Hi, > > > > I am trying to find out solution of one > > problem in dht specific to user xattr healing. > > I tried to correct it in a same way as we are > > doing for healing dir attribute but i feel it is > > not best solution. > > > > To find a right way to heal xattr i want to > > discuss with you if anyone does have better > > solution to correct it. > > > > Problem: > > In a distributed volume environment custom > > extended attribute value for a directory does > > not display correct value after stop/start the > > brick. If any extended attribute value is set > > for a directory after stop the brick the > > attribute value is not updated on brick after > > start the brick. > > > > Current approach: > > 1) function set_user_xattr to store user > > extended attribute in dictionary > > 2) function dht_dir_xattr_heal call > > syncop_setxattr to update the attribute on all > > volume > > 3) Call the function (dht_dir_xattr_heal) > > for every directory lookup in > > dht_lookup_revalidate_cbk > > > > Psuedocode for function dht_dir_xatt_heal is > > like below > > > > 1) First it will fetch atttributes from first > > up volume and store into xattr. > > 2) Run loop on all subvolume and fetch > > existing attributes from every volume > > 3) Replace user attributes from current > > attributes with xattr user attributes > > 4) Set latest extended attributes(current + > > old user attributes) inot subvol. > > > > > > In this current approach problem is > > > > 1) it will call heal > > function(dht_dir_xattr_heal) for every directory > > lookup without comparing xattr. > > 2) The function internally call syncop xattr > > for every subvolume that would be a expensive > > operation. > > > > I have one another way like below to correct > > it but again in this one it does have dependency > > on time (not sure time is synch on all bricks or > > not) > > > > 1) At the time of set extended > > attribute(setxattr) change time in metadata at > > server side > > 2) Compare change time before call healing > > function in dht_revalidate_cbk > > > > Please share your input on this. > > Appreciate your input. > > > > Regards > > Mohit Agrawal > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > <mailto:Gluster-devel@xxxxxxxxxxx> > > http://www.gluster.org/mailman/listinfo/gluster-devel > > <http://www.gluster.org/mailman/listinfo/gluster-devel> > > > > > > > > > > -- > > Pranith > > > > > > > > > > > > -- > > Pranith > > > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx> > > http://www.gluster.org/mailman/listinfo/gluster-devel > > <http://www.gluster.org/mailman/listinfo/gluster-devel> > > > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx> > > http://www.gluster.org/mailman/listinfo/gluster-devel > > <http://www.gluster.org/mailman/listinfo/gluster-devel> > > > > > > > > > > -- > > Raghavendra G > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel