Re: 1402538 : Assertion failure during rebalance of symbolic links

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/14/2016 10:28 AM, Pranith Kumar Karampuri wrote:


On Wed, Dec 14, 2016 at 2:54 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:

    On 12/14/2016 10:17 AM, Pranith Kumar Karampuri wrote:



        On Wed, Dec 14, 2016 at 1:48 PM, Xavier Hernandez
        <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>
        wrote:

            There's another issue with the patch that Ashish sent.

            The original problem is that a setattr on a symbolic link gets
            transformed to a regular file while the fop is being
        executed. Even
            if we apply the Ashish' patch to avoid the assert, the
        setattr fop
            will still succeed and incorrectly change the attributes of a
            gluster special file that shouldn't change.

            I think that's a bigger problem that needs to be addressed
        globally.

            I'm sure this is not an easy solution, but probably the best way
            would be to have distinct inodes for the gluster link files
        and the
            original file. This way most of these problems should be solved.


        Is there any reason why there is a difference in type of the file on
        hashed/cached subvols? We can have the same type of file on both dht
        subvolumes? That will prevent unlink of regular file and
        recreate with
        the actual type of the file?


    I think the problem is not only the type of the inode. There are
    more things involved. If we allow operations intended for regular
    files to succeed on the dht link file itself, the operation won't be
    visible and may affect future actions.

    How it's prevented that the setattr modifies an already created link
    file ? or at least, are these changes propagated to the real file
    later and the link is restored to the original state ? if so, how
    dht detects all this without any locks ? if it's able to detect
    that, why does it send the setattr request anyway ?


I think the assert messages are coming at the time of marking the file
as '---------T' file. DHT makes sure the actual fop happens on the
cached subvolume. But this linkto file will be present in hashed
subvolume indicating it is a linkto file (i.e. 'T' file and there will
be an extended attribute telling where the actual file is in an extended
attrbute). With EC in picture this marking of linkto file by doing a
setattr as '---------T' file is asserting.

If I understand correctly, the real file and the link file not only share the gfid but also share the inode structure itself (otherwise dht would send the new inode to setattr and that problem won't happen). This means that we have a single inode structure to represent a symbolic link and a regular file at the same time. This seems very bad to me.







            Xavi


            On 12/14/2016 09:02 AM, Xavier Hernandez wrote:

                On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote:



                    ----- Original Message -----

                        From: "Pranith Kumar Karampuri"
        <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>
                        <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>>>
                        To: "Ashish Pandey" <aspandey@xxxxxxxxxx
        <mailto:aspandey@xxxxxxxxxx>
                        <mailto:aspandey@xxxxxxxxxx
        <mailto:aspandey@xxxxxxxxxx>>>
                        Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx
        <mailto:gluster-devel@xxxxxxxxxxx>
                        <mailto:gluster-devel@xxxxxxxxxxx
        <mailto:gluster-devel@xxxxxxxxxxx>>>, "Shyam Ranganathan"
                        <srangana@xxxxxxxxxx
        <mailto:srangana@xxxxxxxxxx> <mailto:srangana@xxxxxxxxxx
        <mailto:srangana@xxxxxxxxxx>>>,
                        "Nithya Balachandran"
                        <nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx> <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>>>,
                        "Xavier Hernandez" <xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                        <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>,
                        "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx
        <mailto:rgowdapp@xxxxxxxxxx>
                        <mailto:rgowdapp@xxxxxxxxxx
        <mailto:rgowdapp@xxxxxxxxxx>>>
                        Sent: Tuesday, December 13, 2016 9:29:46 PM
                        Subject: Re: 1402538 : Assertion failure during
                        rebalance of symbolic
                        links

                        On Tue, Dec 13, 2016 at 2:45 PM, Ashish Pandey
                        <aspandey@xxxxxxxxxx
        <mailto:aspandey@xxxxxxxxxx> <mailto:aspandey@xxxxxxxxxx
        <mailto:aspandey@xxxxxxxxxx>>>

                        wrote:

                            Hi All,

                            We have been seeing an issue where re balancing
                            symbolic links leads
                            to an
                            assertion failure in EC volume.

                            The root cause of this is that while migrating
                            symbolic links to
                            other sub
                            volume, it creates a link file (with attributes
                            .........T) .
                            This file is a regular file.
                            Now, during migration a setattr comes to
        this link
                            and because of
                            possible
                            race, posix_stat return stats of this "T" file.
                            In ec_manager_seattr, we receive callbacks
        and check
                            the type of
                            entry. If
                            it is a regular file we try to get size and
        if it is
                            not there, we
                            raise an
                            assert.
                            So, basically we are checking a size of the link
                            (which will not have
                            size) which has been returned as regular
        file and we
                            are ending up when
                            this condition
                            becomes TRUE.

                            Now, this looks like a problem with re
        balance and
                            difficult to fix at
                            this point (as per the discussion).
                            We have an alternative to fix it in EC but
        that will
                            be more like a
                            hack
                            than an actual fix. We should not modify EC
                            to deal with an individual issue which is in
        other
                            translator.


                    I am afraid, dht doesn't have a better way of
        handling this.
                    While DHT
                    maintains abstraction (of a symbolic link) to layers
        above,
                    the layers
                    below it cannot be shielded from seeing the details
        like a
                    linkto file
                    etc.


                That's ok, and I think it's the right thing to do. From
        the point of
                view of EC, it's irrelevant how the file is seen by
        upper layers. It
                only cares about the files below it.

                    If the concern really is that the file is changing
        its type
                    in a span
                    of single fop, we can probably explore the option of
        locking
                    (or other
                    synchronization mechanisms) to prevent migration taking
                    place, while a
                    fop is in progress.


                That's the real problem. Some operations receive an inode
                referencing a
                symbolic link on input but the iatt structures from the
        callback
                reference a regular file. It's even worse because it's an
                asynchronous
                race so some of the bricks may return a regular file and
        some
                may return
                a symbolic link. If there are more than redundancy bricks
                returning a
                different type, the most probably result will be an I/O
        error
                caused by
                inconsistent answers.

                Ashish wrote a patch to check the type of the inode at
        the input
                instead
                of relying on the answers. While this could avoid the
        assertion
                issued
                by ec, it doesn't solve the race, leaving room for the
        I/O errors I
                mentioned earlier.

                    But, I assume there will be performance penalties
        for that too.


                Yes. I don't see any other way to really solve this
        problem. A
                lock is
                needed.

                In ec we already have a problem that will need an
        additional lock on
                rmdir, unlink and rename to avoid some races. This
        change will
                also need
                support from locks xlator to avoid granting locks on deleted
                inodes. If
                dht is using one of these operations to replace the symbolic
                link by the
                gluster link file, I think this change could solve the I/O
                errors, but
                I'm not sure we could completely solve the problem.

                I'm not sure how dht does the transform from a symbolic
        link to a
                gluster link file, but if it involves more than one fop
        from the
                point
                of view of ec, there's nothing that ec can do to solve the
                problem. If
                another client accesses the file, ec can return any
        intermediate
                state.
                DHT should take some lock to do all operations
        atomically and avoid
                problems on other clients.

                I think that the mid-term approach to completely solve
        the problem
                without a performance impact should be to implement some
        kind of
                transaction mechanism that will reuse lock requests.
        This would
                allow,
                among other things, that multiple atomic operations could be
                performed
                by different xlators but sharing the locks instead of
        requiring each
                xlator to take an inodelk on its own.

                Xavi



                            Now the question is how to proceed with
        this? Any
                            suggestions?


                        Raghavendra/Nithya,
                                 Could one of you explain the
        difficulties in
                        fixing this
                        issue in
                        DHT so that Xavi will also be caught up with why we
                        should add this
                        change
                        in EC in the short term.



                            Details on this bug can be found here -

        https://bugzilla.redhat.com/show_bug.cgi?id=1402538
        <https://bugzilla.redhat.com/show_bug.cgi?id=1402538>

        <https://bugzilla.redhat.com/show_bug.cgi?id=1402538
        <https://bugzilla.redhat.com/show_bug.cgi?id=1402538>>

                            ----
                            Ashish






                        --
                        Pranith






        --
        Pranith





--
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel



[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux