Re: 1402538 : Assertion failure during rebalance of symbolic links

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Wed, 14 Dec 2016 10:24:11 +0100

On 12/14/2016 10:17 AM, Pranith Kumar Karampuri wrote:

On Wed, Dec 14, 2016 at 1:48 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:

    There's another issue with the patch that Ashish sent.

    The original problem is that a setattr on a symbolic link gets
    transformed to a regular file while the fop is being executed. Even
    if we apply the Ashish' patch to avoid the assert, the setattr fop
    will still succeed and incorrectly change the attributes of a
    gluster special file that shouldn't change.

    I think that's a bigger problem that needs to be addressed globally.

    I'm sure this is not an easy solution, but probably the best way
    would be to have distinct inodes for the gluster link files and the
    original file. This way most of these problems should be solved.

Is there any reason why there is a difference in type of the file on
hashed/cached subvols? We can have the same type of file on both dht
subvolumes? That will prevent unlink of regular file and recreate with
the actual type of the file?

I think the problem is not only the type of the inode. There are more 
things involved. If we allow operations intended for regular files to 
succeed on the dht link file itself, the operation won't be visible and 
may affect future actions.

How it's prevented that the setattr modifies an already created link 
file ? or at least, are these changes propagated to the real file later 
and the link is restored to the original state ? if so, how dht detects 
all this without any locks ? if it's able to detect that, why does it 
send the setattr request anyway ?

    Xavi

    On 12/14/2016 09:02 AM, Xavier Hernandez wrote:

        On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote:

            ----- Original Message -----

                From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx
                <mailto:pkarampu@xxxxxxxxxx>>
                To: "Ashish Pandey" <aspandey@xxxxxxxxxx
                <mailto:aspandey@xxxxxxxxxx>>
                Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx
                <mailto:gluster-devel@xxxxxxxxxxx>>, "Shyam Ranganathan"
                <srangana@xxxxxxxxxx <mailto:srangana@xxxxxxxxxx>>,
                "Nithya Balachandran"
                <nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>>,
                "Xavier Hernandez" <xhernandez@xxxxxxxxxx
                <mailto:xhernandez@xxxxxxxxxx>>,
                "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx
                <mailto:rgowdapp@xxxxxxxxxx>>
                Sent: Tuesday, December 13, 2016 9:29:46 PM
                Subject: Re: 1402538 : Assertion failure during
                rebalance of symbolic
                links

                On Tue, Dec 13, 2016 at 2:45 PM, Ashish Pandey
                <aspandey@xxxxxxxxxx <mailto:aspandey@xxxxxxxxxx>>
                wrote:

                    Hi All,

                    We have been seeing an issue where re balancing
                    symbolic links leads
                    to an
                    assertion failure in EC volume.

                    The root cause of this is that while migrating
                    symbolic links to
                    other sub
                    volume, it creates a link file (with attributes
                    .........T) .
                    This file is a regular file.
                    Now, during migration a setattr comes to this link
                    and because of
                    possible
                    race, posix_stat return stats of this "T" file.
                    In ec_manager_seattr, we receive callbacks and check
                    the type of
                    entry. If
                    it is a regular file we try to get size and if it is
                    not there, we
                    raise an
                    assert.
                    So, basically we are checking a size of the link
                    (which will not have
                    size) which has been returned as regular file and we
                    are ending up when
                    this condition
                    becomes TRUE.

                    Now, this looks like a problem with re balance and
                    difficult to fix at
                    this point (as per the discussion).
                    We have an alternative to fix it in EC but that will
                    be more like a
                    hack
                    than an actual fix. We should not modify EC
                    to deal with an individual issue which is in other
                    translator.

            I am afraid, dht doesn't have a better way of handling this.
            While DHT
            maintains abstraction (of a symbolic link) to layers above,
            the layers
            below it cannot be shielded from seeing the details like a
            linkto file
            etc.

        That's ok, and I think it's the right thing to do. From the point of
        view of EC, it's irrelevant how the file is seen by upper layers. It
        only cares about the files below it.

            If the concern really is that the file is changing its type
            in a span
            of single fop, we can probably explore the option of locking
            (or other
            synchronization mechanisms) to prevent migration taking
            place, while a
            fop is in progress.

        That's the real problem. Some operations receive an inode
        referencing a
        symbolic link on input but the iatt structures from the callback
        reference a regular file. It's even worse because it's an
        asynchronous
        race so some of the bricks may return a regular file and some
        may return
        a symbolic link. If there are more than redundancy bricks
        returning a
        different type, the most probably result will be an I/O error
        caused by
        inconsistent answers.

        Ashish wrote a patch to check the type of the inode at the input
        instead
        of relying on the answers. While this could avoid the assertion
        issued
        by ec, it doesn't solve the race, leaving room for the I/O errors I
        mentioned earlier.

            But, I assume there will be performance penalties for that too.

        Yes. I don't see any other way to really solve this problem. A
        lock is
        needed.

        In ec we already have a problem that will need an additional lock on
        rmdir, unlink and rename to avoid some races. This change will
        also need
        support from locks xlator to avoid granting locks on deleted
        inodes. If
        dht is using one of these operations to replace the symbolic
        link by the
        gluster link file, I think this change could solve the I/O
        errors, but
        I'm not sure we could completely solve the problem.

        I'm not sure how dht does the transform from a symbolic link to a
        gluster link file, but if it involves more than one fop from the
        point
        of view of ec, there's nothing that ec can do to solve the
        problem. If
        another client accesses the file, ec can return any intermediate
        state.
        DHT should take some lock to do all operations atomically and avoid
        problems on other clients.

        I think that the mid-term approach to completely solve the problem
        without a performance impact should be to implement some kind of
        transaction mechanism that will reuse lock requests. This would
        allow,
        among other things, that multiple atomic operations could be
        performed
        by different xlators but sharing the locks instead of requiring each
        xlator to take an inodelk on its own.

        Xavi

                    Now the question is how to proceed with this? Any
                    suggestions?

                Raghavendra/Nithya,
                         Could one of you explain the difficulties in
                fixing this
                issue in
                DHT so that Xavi will also be caught up with why we
                should add this
                change
                in EC in the short term.

                    Details on this bug can be found here -
                    https://bugzilla.redhat.com/show_bug.cgi?id=1402538
                    <https://bugzilla.redhat.com/show_bug.cgi?id=1402538>

                    ----
                    Ashish

                --
                Pranith

--
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel