On 4/1/19 2:23 PM, Xavi Hernandez wrote:
On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <skoduri@xxxxxxxxxx
<mailto:skoduri@xxxxxxxxxx>> wrote:
On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
>
>
> On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri
<skoduri@xxxxxxxxxx <mailto:skoduri@xxxxxxxxxx>
> <mailto:skoduri@xxxxxxxxxx <mailto:skoduri@xxxxxxxxxx>>> wrote:
>
>
>
> On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> > Hi all,
> >
> > there is one potential problem with posix locks when used in a
> > replicated or dispersed volume.
> >
> > Some background:
> >
> > Posix locks allow any process to lock a region of a file
multiple
> times,
> > but a single unlock on a given region will release all
previous
> locks.
> > Locked regions can be different for each lock request and
they can
> > overlap. The resulting lock will cover the union of all locked
> regions.
> > A single unlock (the region doesn't necessarily need to
match any
> of the
> > ranges used for locking) will create a "hole" in the currently
> locked
> > region, independently of how many times a lock request covered
> that region.
> >
> > For this reason, the locks xlator simply combines the
locked regions
> > that are requested, but it doesn't track each individual
lock range.
> >
> > Under normal circumstances this works fine. But there are
some cases
> > where this behavior is not sufficient. For example, suppose we
> have a
> > replica 3 volume with quorum = 2. Given the special nature
of posix
> > locks, AFR sends the lock request sequentially to each one
of the
> > bricks, to avoid that conflicting lock requests from other
> clients could
> > require to unlock an already locked region on the client
that has
> not
> > got enough successful locks (i.e. quorum). An unlock here not
> only would
> > cancel the current lock request. It would also cancel any
previously
> > acquired lock.
> >
>
> I may not have fully understood, please correct me. AFAIU, lk
xlator
> merges locks only if both the lk-owner and the client opaque
matches.
>
> In the case which you have mentioned above, considering clientA
> acquired
> locks on majority of quorum (say nodeA and nodeB) and clientB
on nodeC
> alone- clientB now has to unlock/cancel the lock it acquired
on nodeC.
>
> You are saying the it could pose a problem if there were already
> successful locks taken by clientB for the same region which
would get
> unlocked by this particular unlock request..right?
>
> Assuming the previous locks acquired by clientB are shared
(otherwise
> clientA wouldn't have got granted lock for the same region on
nodeA &
> nodeB), they would still hold true on nodeA & nodeB as the
unlock
> request was sent to only nodeC. Since the majority of quorum
nodes
> still
> hold the locks by clientB, this isn't serious issue IMO.
>
> I haven't looked into heal part but would like to understand
if this is
> really an issue in normal scenarios as well.
>
>
> This is how I understood the code. Consider the following case:
> Nodes A, B, C have locks with start and end offsets: 5-15 from
mount-1
> and lock-range 2-3 from mount-2.
> If mount-1 requests nonblocking lock with lock-range 1-7 and in
parallel
> lets say mount-2 issued unlock of 2-3 as well.
>
> nodeA got unlock from mount-2 with range 2-3 then lock from
mount-1 with
> range 1-7, so the lock is granted and merged to give 1-15
> nodeB got lock from mount-1 with range 1-7 before unlock of 2-3
which
> leads to EAGAIN which will trigger unlocks on granted lock in
mount-1
> which will end up doing unlock of 1-7 on nodeA leading to lock-range
> 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and
nodeC will
> have range 5-15.
>
> Let me know if my understanding is wrong.
Both of us mentioned the same points. So in the example you gave ,
mount-1 lost its previous lock on nodeA but majority of the quorum
(nodeB and nodeC) still have the previous lock (range: 5-15)
intact. So
this shouldn't ideally lead to any issues as other conflicting locks
are
blocked or failed by majority of the nodes (provided there are no brick
dis/re-connects).
But brick disconnects will happen (upgrades, disk failures, server
maintenance, ...). Anyway, even without brick disconnects, in the
previous example we have nodeA with range 8-15, and nodes B and C with
range 5-15. If another lock from mount-2 comes for range 5-7, it will
succeed on nodeA, but it will block on nodeB. At this point, mount-1
could attempt a lock on same range. It will block on nodeA, so we have a
deadlock.
In general, having discrepancies between bricks is not good because
sooner or later it will cause some bad inconsistency.
Wrt to brick disconnects/re-connects, if we can get in general lock
healing (not getting into implementation details atm) support, that
should take care of correcting lock range on nodeA as well right?
The problem we have seen is that to be able to correctly heal currently
acquired locks on brick reconnect, there are cases where we need to
release a lock that has already been granted (because the current owner
doesn't have enough quorum and a just recovered connection tries to
claim/heal it). In this case we need to deal with locks that have
already been merged, but without interfering with other existing locks
that already have quorum.
Okay. Thanks for the detailed explanation. That clears my doubts.
-Soumya
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel