Re: Query regards to heal xattr heal in dht

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Thu, 15 Sep 2016 13:24:25 +0200

On 15/09/16 11:31, Raghavendra G wrote:

On Thu, Sep 15, 2016 at 12:02 PM, Nithya Balachandran
<nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>> wrote:

    On 8 September 2016 at 12:02, Mohit Agrawal <moagrawa@xxxxxxxxxx
    <mailto:moagrawa@xxxxxxxxxx>> wrote:

        Hi All,

           I have one another solution to heal user xattr but before
        implement it i would like to discuss with you.

           Can i call function (dht_dir_xattr_heal internally it is
        calling syncop_setxattr) to heal xattr in dht_getxattr_cbk in last
           after make sure we have a valid xattr.
           In function(dht_dir_xattr_heal) it will copy blindly all user
        xattr on all subvolume or i can compare subvol xattr with valid
        xattr if there is any mismatch then i will call syncop_setxattr
        otherwise no need to call. syncop_setxattr.

    This can be problematic if a particular xattr is being removed - it
    might still exist on some subvols. IIUC, the heal would go and reset
    it again?

    One option is to use the hash subvol for the dir as the source - so
    perform xattr op on hashed subvol first and on the others only if it
    succeeds on the hashed. This does have the problem of being unable
    to set xattrs if the hashed subvol is unavailable. This might not be
    such a big deal in case of distributed replicate or distribute
    disperse volumes but will affect pure distribute. However, this way
    we can at least be reasonably certain of the correctness (leaving
    rebalance out of the picture).

* What is the behavior of getxattr when hashed subvol is down? Should we
succeed with values from non-hashed subvols or should we fail getxattr?
With hashed-subvol as source of truth, its difficult to determine
correctness of xattrs and their values when it is down.

* setxattr is an inode operation (as opposed to entry operation). So, we
cannot calculate hashed-subvol as in (get)(set)xattr, parent layout and
"basename" is not available. This forces us to store hashed subvol in
inode-ctx. Now, when the hashed-subvol changes we need to update these
inode-ctxs too.

What do you think about a Quorum based solution to this problem?

1. setxattr succeeds only if it is successful on at least (n/2 + 1)
number of subvols.
2. getxattr succeeds only if it is successful and values match on at
least (n/2 + 1) number of subvols.

The flip-side of this solution is we are increasing the probability of
failure of (get)(set)xattr operations as opposed to the hashed-subvol as
source of truth solution. Or are we - how do we compare probability of
hashed-subvol going down with probability of (n/2 + 1) nodes going down
simultaneously? Is it 1/n vs (1/n*1/n*... (n/2+1 times)?. Is 1/n correct
probability for _a specific subvol (hashed-subvol)_ going down (as
opposed to _any one subvol_ going down)?

If we suppose p to be the probability of failure of a subvolume in a 
period of time (a year for example), all subvolumes have the same 
probability, and we have N subvolumes, then:

Probability of failure of hashed-subvol: p
Probability of failure of N/2 + 1 or more subvols: <attached as an image>

Note that this probability says how much probable is that N/2 + 1 
subvols or more fail in the specified period of time, but not 
necessarily simultaneously. If we suppose that subvolumes are recovered 
as fast as possible, the real probability of simultaneous failure will 
be much smaller.

In worst case (not recovering the failed subvolumes in the given period 
of time), if p < 0.5 or N = 2 (and p != 1), then it's always better to 
check N/2 + 1 subvolumes. Otherwise, it's better to check the hashed-subvol.

I think that p should always be much smaller than 0.5 for small periods 
of time where subvolume recovery could no be completed before other 
failures, so checking half plus one subvols should always be the best 
option in terms of probability. Performance can suffer though if some 
kind of synchronization is needed.

Xavi

           Let me know if this approach is suitable.

        Regards
        Mohit Agrawal

        On Wed, Sep 7, 2016 at 10:27 PM, Pranith Kumar Karampuri
        <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>> wrote:

            On Wed, Sep 7, 2016 at 9:46 PM, Mohit Agrawal
            <moagrawa@xxxxxxxxxx <mailto:moagrawa@xxxxxxxxxx>> wrote:

                Hi Pranith,

                In current approach i am getting list of xattr from
                first up volume and update the user attributes from that
                xattr to
                all other volumes.

                I have assumed first up subvol is source and rest of
                them are sink as we are doing same in dht_dir_attr_heal.

            I think first up subvol is different for different mounts as
            per my understanding, I could be wrong.

                Regards
                Mohit Agrawal

                On Wed, Sep 7, 2016 at 9:34 PM, Pranith Kumar Karampuri
                <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>> wrote:

                    hi Mohit,
                           How does dht find which subvolume has the
                    correct list of xattrs? i.e. how does it determine
                    which subvolume is source and which is sink?

                    On Wed, Sep 7, 2016 at 2:35 PM, Mohit Agrawal
                    <moagrawa@xxxxxxxxxx <mailto:moagrawa@xxxxxxxxxx>>
                    wrote:

                        Hi,

                          I am trying to find out solution of one
                        problem in dht specific to user xattr healing.
                          I tried to correct it in a same way as we are
                        doing for healing dir attribute but i feel it is
                        not best solution.

                          To find a right way to heal xattr i want to
                        discuss with you if anyone does have better
                        solution to correct it.

                          Problem:
                           In a distributed volume environment custom
                        extended attribute value for a directory does
                        not display correct value after stop/start the
                        brick. If any extended attribute value is set
                        for a directory after stop the brick the
                        attribute value is not updated on brick after
                        start the brick.

                          Current approach:
                            1) function set_user_xattr to store user
                        extended attribute in dictionary
                            2) function dht_dir_xattr_heal call
                        syncop_setxattr to update the attribute on all
                        volume
                            3) Call the function (dht_dir_xattr_heal)
                        for every directory lookup in
                        dht_lookup_revalidate_cbk

                          Psuedocode for function dht_dir_xatt_heal is
                        like below

                           1) First it will fetch atttributes from first
                        up volume and store into xattr.
                           2) Run loop on all subvolume and fetch
                        existing attributes from every volume
                           3) Replace user attributes from current
                        attributes with xattr user attributes
                           4) Set latest extended attributes(current +
                        old user attributes) inot subvol.

                           In this current approach problem is

                           1) it will call heal
                        function(dht_dir_xattr_heal) for every directory
                        lookup without comparing xattr.
                            2) The function internally call syncop xattr
                        for every subvolume that would be a expensive
                        operation.

                           I have one another way like below to correct
                        it but again in this one it does have dependency
                        on time (not sure time is synch on all bricks or
                        not)

                           1) At the time of set extended
                        attribute(setxattr) change time in metadata at
                        server side
                           2) Compare change time before call healing
                        function in dht_revalidate_cbk

                            Please share your input on this.
                            Appreciate your input.

                        Regards
                        Mohit Agrawal

                        _______________________________________________
                        Gluster-devel mailing list
                        Gluster-devel@xxxxxxxxxxx
                        <mailto:Gluster-devel@xxxxxxxxxxx>
                        http://www.gluster.org/mailman/listinfo/gluster-devel
                        <http://www.gluster.org/mailman/listinfo/gluster-devel>

                    --
                    Pranith

            --
            Pranith

        _______________________________________________
        Gluster-devel mailing list
        Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
        http://www.gluster.org/mailman/listinfo/gluster-devel
        <http://www.gluster.org/mailman/listinfo/gluster-devel>

    _______________________________________________
    Gluster-devel mailing list
    Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
    http://www.gluster.org/mailman/listinfo/gluster-devel
    <http://www.gluster.org/mailman/listinfo/gluster-devel>

--
Raghavendra G

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

Attachment:
prob.png

Description: PNG image
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel