Re: dht mkdir preop check, afr and (non-)readable afr subvols

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Mon, 6 Jun 2016 11:16:46 +0200

Hi Raghavendra,

On 06/06/16 10:54, Raghavendra G wrote:

On Wed, Jun 1, 2016 at 12:50 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:

    Hi,

    On 01/06/16 08:53, Raghavendra Gowdappa wrote:

        ----- Original Message -----

            From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx
            <mailto:xhernandez@xxxxxxxxxx>>
            To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx
            <mailto:pkarampu@xxxxxxxxxx>>, "Raghavendra G"
            <raghavendra@xxxxxxxxxxx <mailto:raghavendra@xxxxxxxxxxx>>
            Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx
            <mailto:gluster-devel@xxxxxxxxxxx>>
            Sent: Wednesday, June 1, 2016 11:57:12 AM
            Subject: Re:  dht mkdir preop check, afr and
            (non-)readable afr subvols

            Oops, you are right. For entry operations the current
            version of the
            parent directory is not checked, just to avoid this problem.

            This means that mkdir will be sent to all alive subvolumes.
            However it
            still selects the group of answers that have a minimum
            quorum equal or
            greater than #bricks - redundancy. So it should be still valid.

        What if the quorum is met on "bad" subvolumes? and mkdir was
        successful on bad subvolumes? Do we consider mkdir as
        successful? If yes, even EC suffers from the problem described
        in bz https://bugzilla.redhat.com/show_bug.cgi?id=1341429.

    I don't understand the real problem. How a subvolume of EC could be
    in bad state from the point of view of DHT ?

    If you use xattrs to configure something in the parent directories,
    you should have needed to use setxattr or xattrop to do that. These
    operations do consider good/bad bricks because they touch inode
    metadata. This will only succeed if enough (quorum) bricks have
    successfully processed it. If quorum is met but for an error answer,
    an error will be reported to DHT and the majority of bricks will be
    left in the old state (these should be considered the good
    subvolumes). If some brick has succeeded, it will be considered bad
    and will be healed. If no quorum is met (even for an error answer),
    EIO will be returned and the state of the directory should be
    considered unknown/damaged.

Yes. Ideally, dht should use a getxattr for the layout xattr. But, for
performance reasons we thought of overloading mkdir by introducing
pre-operations (done by bricks). With plain dht it is a simple
comparison of xattrs passed as argument and xattrs stored on disk. But,
I failed to include afr and EC in the picture.

I still miss something. Looking at the patch that implements this 
(http://review.gluster.org/13885), it seems that mkdir fails if the 
parent xattr is no correctly set, so it's not possible to create a 
directory on a "bad" brick.

If the majority of the subvolumes of ec fail, the whole request will 
fail and this failure will be reported to DHT. If the majority succeed, 
it will be reported to DHT, even is some of the subvolumes have failed.

Maybe if you give me a specific example I may see the real problem.

Xavi

Hence this issue. How
difficult for EC and AFR to bring this kind of check? Is it even
possible for afr and EC to implement this kind of pre-op checks with
reasonable complexity?

    If a later mkdir checks this value in storage/posix and succeeds in
    enough bricks, it necessarily means that is has succeeded in good
    bricks, because there cannot be enough bricks with the bad xattr value.

    Note that quorum is always > #bricks/2 so we cannot have a quorum
    with good and bad bricks at the same time.

    Xavi

            Xavi

            On 01/06/16 06:51, Pranith Kumar Karampuri wrote:

                Xavi,
                        But if we keep winding only to good subvolumes,
                there is a case
                where bad subvolumes will never catch up right? i.e. if
                we keep creating
                files in same directory and everytime self-heal
                completes there are more
                entries mounts would have created on the good subvolumes
                alone. I think
                I must have missed this in the reviews if this is the
                current behavior.
                It was not in the earlier releases. Right?

                Pranith

                On Tue, May 31, 2016 at 2:17 PM, Raghavendra G
                <raghavendra@xxxxxxxxxxx <mailto:raghavendra@xxxxxxxxxxx>
                <mailto:raghavendra@xxxxxxxxxxx
                <mailto:raghavendra@xxxxxxxxxxx>>> wrote:

                    On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
                    <xhernandez@xxxxxxxxxx
                <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
                <mailto:xhernandez@xxxxxxxxxx>>> wrote:

                        Hi,

                        On 31/05/16 07:05, Raghavendra Gowdappa wrote:

                            +gluster-devel, +Xavi

                            Hi all,

                            The context is [1], where bricks do
                pre-operation checks
                            before doing a fop and proceed with fop only
                if pre-op check
                            is successful.

                            @Xavi,

                            We need your inputs on behavior of EC
                subvolumes as well.

                        If I understand correctly, EC shouldn't have any
                problems here.

                        EC sends the mkdir request to all subvolumes
                that are currently
                        considered "good" and tries to combine the
                answers. Answers that
                        match in return code, errno (if necessary) and
                xdata contents
                        (except for some special xattrs that are ignored
                for combination
                        purposes), are grouped.

                        Then it takes the group with more
                members/answers. If that group
                        has a minimum size of #bricks - redundancy, it
                is considered the
                        good answer. Otherwise EIO is returned because
                bricks are in an
                        inconsistent state.

                        If there's any answer in another group, it's
                considered bad and
                        gets marked so that self-heal will repair it
                using the good
                        information from the majority of bricks.

                        xdata is combined and returned even if return
                code is -1.

                        Is that enough to cover the needed behavior ?

                    Thanks Xavi. That's sufficient for the feature in
                question. One of
                    the main cases I was interested in was what would be
                the behaviour
                    if mkdir succeeds on "bad" subvolume and fails on
                "good" subvolume.
                    Since you never wind mkdir to "bad" subvolume(s),
                this situation
                    never arises.

                        Xavi

                            [1] http://review.gluster.org/13885

                            regards,
                            Raghavendra

                            ----- Original Message -----

                                From: "Pranith Kumar Karampuri"
                <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>
                                <mailto:pkarampu@xxxxxxxxxx
                <mailto:pkarampu@xxxxxxxxxx>>>
                                To: "Raghavendra Gowdappa"
                <rgowdapp@xxxxxxxxxx <mailto:rgowdapp@xxxxxxxxxx>
                                <mailto:rgowdapp@xxxxxxxxxx
                <mailto:rgowdapp@xxxxxxxxxx>>>
                                Cc: "team-quine-afr"
                <team-quine-afr@xxxxxxxxxx
                <mailto:team-quine-afr@xxxxxxxxxx>
                                <mailto:team-quine-afr@xxxxxxxxxx
                <mailto:team-quine-afr@xxxxxxxxxx>>>, "rhs-zteam"
                                <rhs-zteam@xxxxxxxxxx
                <mailto:rhs-zteam@xxxxxxxxxx>
                <mailto:rhs-zteam@xxxxxxxxxx <mailto:rhs-zteam@xxxxxxxxxx>>>
                                Sent: Tuesday, May 31, 2016 10:22:49 AM
                                Subject: Re: dht mkdir preop check, afr and
                                (non-)readable afr subvols

                                I think you should start a discussion on
                gluster-devel
                                so that Xavi gets a
                                chance to respond on the mails as well.

                                On Tue, May 31, 2016 at 10:21 AM,
                Raghavendra Gowdappa
                                <rgowdapp@xxxxxxxxxx
                <mailto:rgowdapp@xxxxxxxxxx> <mailto:rgowdapp@xxxxxxxxxx
                <mailto:rgowdapp@xxxxxxxxxx>>>
                                wrote:

                                    Also note that we've plans to extend
                this pre-op
                                    check to all dentry
                                    operations which also depend parent
                layout. So, the
                                    discussion need to
                                    cover all dentry operations like:

                                    1. create
                                    2. mkdir
                                    3. rmdir
                                    4. mknod
                                    5. symlink
                                    6. unlink
                                    7. rename

                                    We also plan to have similar checks
                in lock codepath
                                    for directories too
                                    (planning to use hashed-subvolume as
                lock-subvolume
                                    for directories). So,
                                    more fops :)
                                    8. lk (posix locks)
                                    9. inodelk
                                    10. entrylk

                                    regards,
                                    Raghavendra

                                    ----- Original Message -----

                                        From: "Raghavendra Gowdappa"
                                        <rgowdapp@xxxxxxxxxx
                <mailto:rgowdapp@xxxxxxxxxx> <mailto:rgowdapp@xxxxxxxxxx
                <mailto:rgowdapp@xxxxxxxxxx>>>
                                        To: "team-quine-afr"
                <team-quine-afr@xxxxxxxxxx
                <mailto:team-quine-afr@xxxxxxxxxx>

                <mailto:team-quine-afr@xxxxxxxxxx
                <mailto:team-quine-afr@xxxxxxxxxx>>>
                                        Cc: "rhs-zteam"
                <rhs-zteam@xxxxxxxxxx <mailto:rhs-zteam@xxxxxxxxxx>
                                        <mailto:rhs-zteam@xxxxxxxxxx
                <mailto:rhs-zteam@xxxxxxxxxx>>>
                                        Sent: Tuesday, May 31, 2016
                10:15:04 AM
                                        Subject: dht mkdir preop check,
                afr and
                                        (non-)readable afr subvols

                                        Hi all,

                                        I have some queries related to
                the behavior of
                                        afr_mkdir with respect to
                                        readable subvols.

                                        1. While winding mkdir to
                subvols does afr check
                                        whether the subvolume is
                                        good/readable? Or does it wind
                to all subvols
                                        irrespective of whether a
                                        subvol is good/bad? In the
                latter case, what if
                                           a. mkdir succeeds on
                non-readable subvolume
                                           b. fails on readable subvolume

                                          What is the result reported to
                higher layers
                                        in the above scenario? If
                                          mkdir is failed, is it cleaned
                up on
                                        non-readable subvolume where it
                                          failed?

                                        I am interested in this case as
                dht-preop check
                                        relies on layout xattrs

                                    and I

                                        assume layout xattrs in
                particular (and all
                                        xattrs in general) are
                                        guaranteed to be correct only on
                a readable
                                        subvolume of afr. So, in

                                    essence

                                        we shouldn't be winding down
                mkdir on
                                        non-readable subvols as whatever

                                    the

                                        decision brick makes as part of
                pre-op check is
                                        inherently flawed.

                                        regards,
                                        Raghavendra

                                --
                                Pranith

                        _______________________________________________
                        Gluster-devel mailing list
                        Gluster-devel@xxxxxxxxxxx
                <mailto:Gluster-devel@xxxxxxxxxxx>
                <mailto:Gluster-devel@xxxxxxxxxxx
                <mailto:Gluster-devel@xxxxxxxxxxx>>

                http://www.gluster.org/mailman/listinfo/gluster-devel

                    --
                    Raghavendra G

                    _______________________________________________
                    Gluster-devel mailing list
                    Gluster-devel@xxxxxxxxxxx
                <mailto:Gluster-devel@xxxxxxxxxxx>
                <mailto:Gluster-devel@xxxxxxxxxxx
                <mailto:Gluster-devel@xxxxxxxxxxx>>
                    http://www.gluster.org/mailman/listinfo/gluster-devel

                --
                Pranith

    _______________________________________________
    Gluster-devel mailing list
    Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
    http://www.gluster.org/mailman/listinfo/gluster-devel

--
Raghavendra G
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel