Re: Lock migration as a part of rebalance

Soumya Koduri <skoduri@xxxxxxxxxx> · Tue, 28 Apr 2015 07:43:27 +0530

On 01/03/2015 12:37 AM, Shyam wrote:
On 12/17/2014 02:15 AM, Raghavendra G wrote:

On Wed, Dec 17, 2014 at 1:25 AM, Shyam <srangana@xxxxxxxxxx
<mailto:srangana@xxxxxxxxxx>> wrote:

    This mail intends to present the lock migration across subvolumes
    problem and seek solutions/thoughts around the same, so any
    feedback/corrections are appreciated.

    # Current state of file locks post file migration during rebalance
    Currently when a file is migrated during rebalance, its lock
    information is not transferred over from the old subvol to the new
    subvol, that the file now resides on.

    As further lock requests, post migration of the file, would now be
    sent to the new subvol, any potential lock conflicts would not be
    detected, until the locks are migrated over.

    The term locks above can refer to the POSIX locks aquired using the
    FOP lk by consumers of the volume, or to the gluster internal(?)
    inode/dentry locks. For now we limit the discussion to the POSIX
    locks supported by the FOP lk.

    # Other areas in gluster that migrate locks
    Current scheme of migrating locks in gluster on graph switches,
    trigger an fd migration process that migrates the lock information
    from the old fd to the new fd. This is driven by the gluster client
    stack, protocol layer (FUSE, gfapi).

    This is done using the (set/get)xattr call with the attr name,
    "trusted.glusterfs.lockinfo". Which in turn fetches the required key
    for the old fd, and migrates the lock from this old fd to new fd.
    IOW, there is very little information transferred as the locks are
    migrated across fds on the same subvolume and not across subvolumes.

    Additionally locks that are in the blocked state, do not seem to be
    migrated (at least the function to do so in FUSE is empty
    (fuse_handle_blocked_locks), need to run a test case to confirm), or
    responded to with an error.

    # High level solution requirements when migrating locks across
subvols
    1) Block/deny new lock acquisitions on the new subvol, till locks
    are migrated
       - So that new locks that have overlapping ranges to the older
    ones are not granted
       - Potentially return EINTR on such requests?
    2) Ensure all _acquired_ locks from all clients are migrated first
       - So that if and when placing blocked lock requests, these really
    do block for previous reasons and are not granted now
    3) Migrate blocked locks post acquired locks are migrated (in any
    order?)
         - OR, send back EINTR for the blocked locks

    (When we have upcalls/delegations added as features, those would
    have similar requirements for migration across subvolumes)

    # Potential processes that could migrate the locks and issues thereof
    1) The rebalance process, that migrates the file can help with
    migrating the locks, which would not involve any clients to the
    gluster volume

    Issues:
        - Lock information is fd specific, when migrating these locks,
    the clients need not have detected that the file is migrated, and
    hence opened an fd against the new subvol, which when missing, would
    make this form of migration a little more interesting
        - Lock information also has client connection specific pointer
    (client_t) that needs to be reassigned on the new subvol
        - Other subvol specific information, maintained in the lock,
    that needs to be migrated over will suffer the same
    limitations/solutions

The tricky thing here is that rebalance process has no control over when
1. fd will be opened on dst-node, since clients open fd on dst-node
on-demand based on the I/O happening through them.

(Read ** below, remaining thoughts/responses are more based on iff we
could identify clients across nodes)

We should _maybe_ have dangling fds opened on the dst-node, which can be
mapped to the incoming requests from the clients (whenever they come).
In case they do not, we still have the current problem that the fd on
the src-node is leaked (or held till a client request comes its way).

A lock migration, should migrate the fd and its associated information,
and leave it dangling, till the client tries to establish the same fd
(i.e via DHT xlator on the client). Thoughts?

2. client establishes connection on dst-node (client might've been cut
off from dst-node).

First part would be, if we have a _static_ client mapping, if we do,
then client need not be connected when we migrate the file, and leave a
dangling_fd at the destination. In case clients do not have this, then
we can deny file migration as we are unable to map out the client
relation on the other end. Would that be reasonable?

Also, on reconnects do clients get different identification information?

yes on reconnects, clients connect with a different ID.

Unless we've a global mapping (like a client can always be identified
using same uuid irrespective of the brick we are looking) this seems
like a difficult thing to achieve.

(**) Do we have any such mapping at present? Meaning, if a client is
connected to src and dst subvolumes, then would it have the same
UUID/connection information? Or, _any_ way to identify with certainty,
they are the same client?

In case the client is not connected to the dst-node, is there any way to
identify the client as being the same as the one connected to the
src-node, when it connects later to the dst-node?

Unless server can parse the client_uid, IMO it may not be straight 
forward. However from the below code snippet -

>>>>>>>
       /* When lock-heal is enabled:
         * With multiple graphs possible in the same process, we need a
           field to bring the uniqueness. Graph-ID should be enough to 
get the
           job done.
         * When lock-heal is disabled, connection-id should always be 
unique so
         * that server never gets to reuse the previous connection 
resources
         * so it cleans up the resources on every disconnect. Otherwise
         * it may lead to stale resources, i.e. leaked file desciptors,
         * inode/entry locks
        */
        if (!conf->lk_heal) {
                snprintf (counter_str, sizeof (counter_str),
                          "-%"PRIu64, conf->setvol_count);
                conf->setvol_count++;
        }
        ret = gf_asprintf (&process_uuid_xl, "%s-%s-%d%s",
                           this->ctx->process_uuid, this->name,
                           this->graph->id, counter_str);
        if (-1 == ret) {
                gf_log (this->name, GF_LOG_ERROR,
                        "asprintf failed while setting process_uuid");
                goto fail;
        }
<<<<<<<<<

Looks like there was an attempt to make it uniform in case of lk_heal 
(which is disabled atm). We may need to enable it and check/fix the 
issues, which may help us in migrating locks across the bricks as well 
as you have proposed along with self-heal of locks.

Thanks,
Soumya

    Benefits:
        - Can lock out/block newer lock requests effectively
        - Need not _wait_ till all clients have registered that the file
    is under migration and/or migrated their locks

    2) DHT xlator in each client could be held responsible to migrate
    its locks to the new subvolume

    Issues:
        - Somehow need to let every client know that locks need to be
    migrated (upcall infrastructure?)
        - What if some client is not reachable at the given time?
        - Have to wait till all clients replay the locks

    Benefits:
        - Hmmm... Nothing really, if we could do it by the rebalance
    process itself the solution maybe better.

    # Overall thoughts
    - We could/should return EINTR for blocked locks, in the case of a
    graph switch, and the case of a file migration, this would relieve
    the design of that particular complexity, and is a legal error to
    return from a flock/fcntl operation

    - If we can extract and map out all relevant lock information across
    subvolumes, then having rebalance do this work seems like a good
    fit. Additionally this could serve as a good way to migrate upcall
    requests and state as well

Adding a further note here, we could deny migration of a file in case we
are unable to map out all the relevant lock information. That way some
files would not be migrated due to inability to migrate all relevant
information regarding the same.

    Thoughts?

    Shyam
    _________________________________________________
    Gluster-devel mailing list
    Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
    http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
    <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>

--
Raghavendra G
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel