Re: Issue with posix locks

Soumya Koduri <skoduri@xxxxxxxxxx> · Mon, 1 Apr 2019 15:40:21 +0530

On 4/1/19 2:23 PM, Xavi Hernandez wrote:
On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <skoduri@xxxxxxxxxx 
<mailto:skoduri@xxxxxxxxxx>> wrote:

    On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
     >
     >
     > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri
    <skoduri@xxxxxxxxxx <mailto:skoduri@xxxxxxxxxx>
     > <mailto:skoduri@xxxxxxxxxx <mailto:skoduri@xxxxxxxxxx>>> wrote:
     >
     >
     >
     >     On 3/29/19 11:55 PM, Xavi Hernandez wrote:
     >      > Hi all,
     >      >
     >      > there is one potential problem with posix locks when used in a
     >      > replicated or dispersed volume.
     >      >
     >      > Some background:
     >      >
     >      > Posix locks allow any process to lock a region of a file
    multiple
     >     times,
     >      > but a single unlock on a given region will release all
    previous
     >     locks.
     >      > Locked regions can be different for each lock request and
    they can
     >      > overlap. The resulting lock will cover the union of all locked
     >     regions.
     >      > A single unlock (the region doesn't necessarily need to
    match any
     >     of the
     >      > ranges used for locking) will create a "hole" in the currently
     >     locked
     >      > region, independently of how many times a lock request covered
     >     that region.
     >      >
     >      > For this reason, the locks xlator simply combines the
    locked regions
     >      > that are requested, but it doesn't track each individual
    lock range.
     >      >
     >      > Under normal circumstances this works fine. But there are
    some cases
     >      > where this behavior is not sufficient. For example, suppose we
     >     have a
     >      > replica 3 volume with quorum = 2. Given the special nature
    of posix
     >      > locks, AFR sends the lock request sequentially to each one
    of the
     >      > bricks, to avoid that conflicting lock requests from other
     >     clients could
     >      > require to unlock an already locked region on the client
    that has
     >     not
     >      > got enough successful locks (i.e. quorum). An unlock here not
     >     only would
     >      > cancel the current lock request. It would also cancel any
    previously
     >      > acquired lock.
     >      >
     >
     >     I may not have fully understood, please correct me. AFAIU, lk
    xlator
     >     merges locks only if both the lk-owner and the client opaque
    matches.
     >
     >     In the case which you have mentioned above, considering clientA
     >     acquired
     >     locks on majority of quorum (say nodeA and nodeB) and clientB
    on nodeC
     >     alone- clientB now has to unlock/cancel the lock it acquired
    on nodeC.
     >
     >     You are saying the it could pose a problem if there were already
     >     successful locks taken by clientB for the same region which
    would get
     >     unlocked by this particular unlock request..right?
     >
     >     Assuming the previous locks acquired by clientB are shared
    (otherwise
     >     clientA wouldn't have got granted lock for the same region on
    nodeA &
     >     nodeB), they would still hold true on nodeA & nodeB  as the
    unlock
     >     request was sent to only nodeC. Since the majority of quorum
    nodes
     >     still
     >     hold the locks by clientB, this isn't serious issue IMO.
     >
     >     I haven't looked into heal part but would like to understand
    if this is
     >     really an issue in normal scenarios as well.
     >
     >
     > This is how I understood the code. Consider the following case:
     > Nodes A, B, C have locks with start and end offsets: 5-15 from
    mount-1
     > and lock-range 2-3 from mount-2.
     > If mount-1 requests nonblocking lock with lock-range 1-7 and in
    parallel
     > lets say mount-2 issued unlock of 2-3 as well.
     >
     > nodeA got unlock from mount-2 with range 2-3 then lock from
    mount-1 with
     > range 1-7, so the lock is granted and merged to give 1-15
     > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3
    which
     > leads to EAGAIN which will trigger unlocks on granted lock in
    mount-1
     > which will end up doing unlock of 1-7 on nodeA leading to lock-range
     > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and
    nodeC will
     > have range 5-15.
     >
     > Let me know if my understanding is wrong.

    Both of us mentioned the same points. So in the example you gave ,
    mount-1 lost its previous lock on nodeA but majority of the quorum
    (nodeB and nodeC) still have the previous lock  (range: 5-15)
    intact. So
    this shouldn't ideally lead to any issues as other conflicting locks
    are
    blocked or failed by majority of the nodes (provided there are no brick
    dis/re-connects).

But brick disconnects will happen (upgrades, disk failures, server 
maintenance, ...). Anyway, even without brick disconnects, in the 
previous example we have nodeA with range 8-15, and nodes B and C with 
range 5-15. If another lock from mount-2 comes for range 5-7, it will 
succeed on nodeA, but it will block on nodeB. At this point, mount-1 
could attempt a lock on same range. It will block on nodeA, so we have a 
deadlock.

In general, having discrepancies between bricks is not good because 
sooner or later it will cause some bad inconsistency.

    Wrt to brick disconnects/re-connects, if we can get in general lock
    healing (not getting into implementation details atm) support, that
    should take care of correcting lock range on nodeA as well right?

The problem we have seen is that to be able to correctly heal currently 
acquired locks on brick reconnect, there are cases where we need to 
release a lock that has already been granted (because the current owner 
doesn't have enough quorum and a just recovered connection tries to 
claim/heal it). In this case we need to deal with locks that have 
already been merged, but without interfering with other existing locks 
that already have quorum.

Okay. Thanks for the detailed explanation. That clears my doubts.

-Soumya
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel