Re: Rebalance data migration and corruption

Soumya Koduri <skoduri@xxxxxxxxxx> · Tue, 9 Feb 2016 11:27:55 +0530

On 02/09/2016 10:27 AM, Raghavendra G wrote:

On Mon, Feb 8, 2016 at 4:31 PM, Soumya Koduri <skoduri@xxxxxxxxxx
<mailto:skoduri@xxxxxxxxxx>> wrote:

    On 02/08/2016 09:13 AM, Shyam wrote:

        On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:

            ----- Original Message -----

                From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx
                <mailto:rgowdapp@xxxxxxxxxx>>
                To: "Sakshi Bansal" <sabansal@xxxxxxxxxx
                <mailto:sabansal@xxxxxxxxxx>>, "Susant Palai"
                <spalai@xxxxxxxxxx <mailto:spalai@xxxxxxxxxx>>
                Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx
                <mailto:gluster-devel@xxxxxxxxxxx>>, "Nithya
                Balachandran" <nbalacha@xxxxxxxxxx
                <mailto:nbalacha@xxxxxxxxxx>>, "Shyamsundar
                Ranganathan" <srangana@xxxxxxxxxx
                <mailto:srangana@xxxxxxxxxx>>
                Sent: Friday, February 5, 2016 4:32:40 PM
                Subject: Re: Rebalance data migration and corruption

                +gluster-devel

                    Hi Sakshi/Susant,

                    - There is a data corruption issue in migration
                    code. Rebalance
                    process,
                        1. Reads data from src
                        2. Writes (say w1) it to dst

                        However, 1 and 2 are not atomic, so another
                    write (say w2) to
                    same region
                        can happen between 1. But these two writes can
                    reach dst in the
                    order
                        (w2,
                        w1) resulting in a subtle corruption. This issue
                    is not fixed yet
                    and can
                        cause subtle data corruptions. The fix is simple
                    and involves
                    rebalance
                        process acquiring a mandatory lock to make 1 and
                    2 atomic.

                We can make use of compound fop framework to make sure
                we don't suffer a
                significant performance hit. Following will be the
                sequence of
                operations
                done by rebalance process:

                1. issues a compound (mandatory lock, read) operation on
                src.
                2. writes this data to dst.
                3. issues unlock of lock acquired in 1.

                Please co-ordinate with Anuradha for implementation of
                this compound
                fop.

                Following are the issues I see with this approach:
                1. features/locks provides mandatory lock functionality
                only for
                posix-locks
                (flock and fcntl based locks). So, mandatory locks will be
                posix-locks which
                will conflict with locks held by application. So, if an
                application
                has held
                an fcntl/flock, migration cannot proceed.

    What if the file is opened with O_NONBLOCK? Cant rebalance process
    skip the file and continue in case if mandatory lock acquisition fails?

Similar functionality can be achieved by acquiring non-blocking inodelk
like SETLK (as opposed to SETLKW). However whether rebalance process
should block or not depends on the use case. In Some use-cases (like
remove-brick) rebalance process _has_ to migrate all the files. Even for
other scenarios skipping too many files is not a good idea as it beats
the purpose of running rebalance. So one of the design goals is to
migrate as many files as possible without making design too complex.

            We can implement a "special" domain for mandatory internal
            locks.
            These locks will behave similar to posix mandatory locks in that
            conflicting fops (like write, read) are blocked/failed if
            they are
            done while a lock is held.

    So is the only difference between mandatory internal locks and posix
    mandatory locks is that internal locks shall not conflict with other
    application locks(advisory/mandatory)?

Yes. Mandatory internal locks (aka Mandatory inodelk for this
discussion) will conflict only in their domain. They also conflict with
any fops that might change the file (primarily write here, but different
fops can be added based on requirement). So in a fop like writev we need
to check in two lists - external lock (posix lock) list _and_ mandatory
inodelk list.

The reason (if not clear) for using mandatory locks by rebalance process
is that clients need not be bothered with acquiring a lock (which will
unnecessarily degrade performance of I/O when there is no rebalance
going on). Thanks to Raghavendra Talur for suggesting this idea (though
in a different context of lock migration, but the use-cases are similar).

                2. data migration will be less efficient because of an
                extra unlock
                (with
                compound lock + read) or extra lock and unlock (for
                non-compound fop
                based
                implementation) for every read it does from src.

            Can we use delegations here? Rebalance process can acquire a
            mandatory-write-delegation (an exclusive lock with a
            functionality
            that delegation is recalled when a write operation happens).
            In that
            case rebalance process, can do something like:

            1. Acquire a read delegation for entire file.
            2. Migrate the entire file.
            3. Remove/unlock/give-back the delegation it has acquired.

            If a recall is issued from brick (when a write happens from
            mount), it
            completes the current write to dst (or throws away the read
            from src)
            to maintain atomicity. Before doing next set of (read, src) and
            (write, dst) tries to reacquire lock.

        With delegations this simplifies the normal path, when a file is
        exclusively handled by rebalance. It also improves the case where a
        client and rebalance are conflicting on a file, to degrade to
        mandatory
        locks by either parties.

        I would prefer we take the delegation route for such needs in
        the future.

    Right. But if there are simultaneous access to the same file from
    any other client and rebalance process, delegations shall not be
    granted or revoked if granted even though they are operating at
    different offsets. So if you rely only on delegations, migration may
    not proceed if an application has held a lock or doing any I/Os.

Does the brick process wait for the response of delegation holder
(rebalance process here) before it wipes out the delegation/locks? If
that's the case, rebalance process can complete one transaction of
(read, src) and (write, dst) before responding to a delegation recall.
That way there is no starvation for both applications and rebalance
process (though this makes both of them slower, but that cannot helped I
think).

yes. Brick process should wait for certain period before revoking the 
delegations forcefully in case if it is not returned by the client. Also 
if required (like done by NFS servers) we can choose to increase this 
timeout value at run time if the client is diligently flushing the data.

    Also ideally rebalance process has to take write delegation as it
    would end up writing the data on destination brick which shall
    affect READ I/Os, (though of course we can have special checks/hacks
    for internal generated fops).

No, read delegations (on src) are sufficient for our use case. All we
need is that if there is a write on src while rebalance-process has a
delegation, We need that write to be blocked till rebalance process
returns that delegation back. Write delegations are unnecessarily more
restrictive as they conflict with application reads too, which we don't
need. For the sake of clarity client always writes to src first and then
to dst. Also, writes to src and dst are serialized. So, its sufficient
we synchronize on src.

Okay. So unlike applications clients, rebalance process could choose to 
selectively issue lease call to only src brick then. If that is the 
case, then read-delegation shall suffice this use-case.

Thanks,
Soumya

    That said, having delegations shall definitely ensure correctness
    with respect to exclusive file access.

    Thanks,
    Soumya

            @Soumyak, can something like this be done with delegations?

            @Pranith,
            Afr does transactions for writing to its subvols. Can you
            suggest any
            optimizations here so that rebalance process can have a
            transaction
            for (read, src) and (write, dst) with minimal performance
            overhead?

            regards,
            Raghavendra.

                Comments?

                    regards,
                    Raghavendra.

    _______________________________________________
    Gluster-devel mailing list
    Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
    http://www.gluster.org/mailman/listinfo/gluster-devel

--
Raghavendra G
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel