Re: dht mkdir preop check, afr and (non-)readable afr subvols

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Wed, 1 Jun 2016 09:20:08 +0200

Hi,

On 01/06/16 08:53, Raghavendra Gowdappa wrote:

----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Raghavendra G" <raghavendra@xxxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Wednesday, June 1, 2016 11:57:12 AM
Subject: Re:  dht mkdir preop check, afr and (non-)readable afr subvols

Oops, you are right. For entry operations the current version of the
parent directory is not checked, just to avoid this problem.

This means that mkdir will be sent to all alive subvolumes. However it
still selects the group of answers that have a minimum quorum equal or
greater than #bricks - redundancy. So it should be still valid.

What if the quorum is met on "bad" subvolumes? and mkdir was successful on bad subvolumes? Do we consider mkdir as successful? If yes, even EC suffers from the problem described in bz https://bugzilla.redhat.com/show_bug.cgi?id=1341429.

I don't understand the real problem. How a subvolume of EC could be in 
bad state from the point of view of DHT ?

If you use xattrs to configure something in the parent directories, you 
should have needed to use setxattr or xattrop to do that. These 
operations do consider good/bad bricks because they touch inode 
metadata. This will only succeed if enough (quorum) bricks have 
successfully processed it. If quorum is met but for an error answer, an 
error will be reported to DHT and the majority of bricks will be left in 
the old state (these should be considered the good subvolumes). If some 
brick has succeeded, it will be considered bad and will be healed. If no 
quorum is met (even for an error answer), EIO will be returned and the 
state of the directory should be considered unknown/damaged.

If a later mkdir checks this value in storage/posix and succeeds in 
enough bricks, it necessarily means that is has succeeded in good 
bricks, because there cannot be enough bricks with the bad xattr value.

Note that quorum is always > #bricks/2 so we cannot have a quorum with 
good and bad bricks at the same time.

Xavi

Xavi

On 01/06/16 06:51, Pranith Kumar Karampuri wrote:
Xavi,
        But if we keep winding only to good subvolumes, there is a case
where bad subvolumes will never catch up right? i.e. if we keep creating
files in same directory and everytime self-heal completes there are more
entries mounts would have created on the good subvolumes alone. I think
I must have missed this in the reviews if this is the current behavior.
It was not in the earlier releases. Right?

Pranith

On Tue, May 31, 2016 at 2:17 PM, Raghavendra G <raghavendra@xxxxxxxxxxx
<mailto:raghavendra@xxxxxxxxxxx>> wrote:

    On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
    <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

        Hi,

        On 31/05/16 07:05, Raghavendra Gowdappa wrote:

            +gluster-devel, +Xavi

            Hi all,

            The context is [1], where bricks do pre-operation checks
            before doing a fop and proceed with fop only if pre-op check
            is successful.

            @Xavi,

            We need your inputs on behavior of EC subvolumes as well.

        If I understand correctly, EC shouldn't have any problems here.

        EC sends the mkdir request to all subvolumes that are currently
        considered "good" and tries to combine the answers. Answers that
        match in return code, errno (if necessary) and xdata contents
        (except for some special xattrs that are ignored for combination
        purposes), are grouped.

        Then it takes the group with more members/answers. If that group
        has a minimum size of #bricks - redundancy, it is considered the
        good answer. Otherwise EIO is returned because bricks are in an
        inconsistent state.

        If there's any answer in another group, it's considered bad and
        gets marked so that self-heal will repair it using the good
        information from the majority of bricks.

        xdata is combined and returned even if return code is -1.

        Is that enough to cover the needed behavior ?

    Thanks Xavi. That's sufficient for the feature in question. One of
    the main cases I was interested in was what would be the behaviour
    if mkdir succeeds on "bad" subvolume and fails on "good" subvolume.
    Since you never wind mkdir to "bad" subvolume(s), this situation
    never arises.

        Xavi

            [1] http://review.gluster.org/13885

            regards,
            Raghavendra

            ----- Original Message -----

                From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx
                <mailto:pkarampu@xxxxxxxxxx>>
                To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx
                <mailto:rgowdapp@xxxxxxxxxx>>
                Cc: "team-quine-afr" <team-quine-afr@xxxxxxxxxx
                <mailto:team-quine-afr@xxxxxxxxxx>>, "rhs-zteam"
                <rhs-zteam@xxxxxxxxxx <mailto:rhs-zteam@xxxxxxxxxx>>
                Sent: Tuesday, May 31, 2016 10:22:49 AM
                Subject: Re: dht mkdir preop check, afr and
                (non-)readable afr subvols

                I think you should start a discussion on gluster-devel
                so that Xavi gets a
                chance to respond on the mails as well.

                On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa
                <rgowdapp@xxxxxxxxxx <mailto:rgowdapp@xxxxxxxxxx>>
                wrote:

                    Also note that we've plans to extend this pre-op
                    check to all dentry
                    operations which also depend parent layout. So, the
                    discussion need to
                    cover all dentry operations like:

                    1. create
                    2. mkdir
                    3. rmdir
                    4. mknod
                    5. symlink
                    6. unlink
                    7. rename

                    We also plan to have similar checks in lock codepath
                    for directories too
                    (planning to use hashed-subvolume as lock-subvolume
                    for directories). So,
                    more fops :)
                    8. lk (posix locks)
                    9. inodelk
                    10. entrylk

                    regards,
                    Raghavendra

                    ----- Original Message -----

                        From: "Raghavendra Gowdappa"
                        <rgowdapp@xxxxxxxxxx <mailto:rgowdapp@xxxxxxxxxx>>
                        To: "team-quine-afr" <team-quine-afr@xxxxxxxxxx
                        <mailto:team-quine-afr@xxxxxxxxxx>>
                        Cc: "rhs-zteam" <rhs-zteam@xxxxxxxxxx
                        <mailto:rhs-zteam@xxxxxxxxxx>>
                        Sent: Tuesday, May 31, 2016 10:15:04 AM
                        Subject: dht mkdir preop check, afr and
                        (non-)readable afr subvols

                        Hi all,

                        I have some queries related to the behavior of
                        afr_mkdir with respect to
                        readable subvols.

                        1. While winding mkdir to subvols does afr check
                        whether the subvolume is
                        good/readable? Or does it wind to all subvols
                        irrespective of whether a
                        subvol is good/bad? In the latter case, what if
                           a. mkdir succeeds on non-readable subvolume
                           b. fails on readable subvolume

                          What is the result reported to higher layers
                        in the above scenario? If
                          mkdir is failed, is it cleaned up on
                        non-readable subvolume where it
                          failed?

                        I am interested in this case as dht-preop check
                        relies on layout xattrs

                    and I

                        assume layout xattrs in particular (and all
                        xattrs in general) are
                        guaranteed to be correct only on a readable
                        subvolume of afr. So, in

                    essence

                        we shouldn't be winding down mkdir on
                        non-readable subvols as whatever

                    the

                        decision brick makes as part of pre-op check is
                        inherently flawed.

                        regards,
                        Raghavendra

                --
                Pranith

        _______________________________________________
        Gluster-devel mailing list
        Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
        http://www.gluster.org/mailman/listinfo/gluster-devel

    --
    Raghavendra G

    _______________________________________________
    Gluster-devel mailing list
    Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
    http://www.gluster.org/mailman/listinfo/gluster-devel

--
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel