Re: Issue with posix locks

Xavi Hernandez <xhernandez@xxxxxxxxxx> · Mon, 1 Apr 2019 10:53:23 +0200

On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <skoduri@xxxxxxxxxx> wrote:

On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:

> 

> 

> On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri <skoduri@xxxxxxxxxx 

> <mailto:skoduri@xxxxxxxxxx>> wrote:

> 

> 

> 

>     On 3/29/19 11:55 PM, Xavi Hernandez wrote:

>      > Hi all,

>      >

>      > there is one potential problem with posix locks when used in a

>      > replicated or dispersed volume.

>      >

>      > Some background:

>      >

>      > Posix locks allow any process to lock a region of a file multiple

>     times,

>      > but a single unlock on a given region will release all previous

>     locks.

>      > Locked regions can be different for each lock request and they can

>      > overlap. The resulting lock will cover the union of all locked

>     regions.

>      > A single unlock (the region doesn't necessarily need to match any

>     of the

>      > ranges used for locking) will create a "hole" in the currently

>     locked

>      > region, independently of how many times a lock request covered

>     that region.

>      >

>      > For this reason, the locks xlator simply combines the locked regions

>      > that are requested, but it doesn't track each individual lock range.

>      >

>      > Under normal circumstances this works fine. But there are some cases

>      > where this behavior is not sufficient. For example, suppose we

>     have a

>      > replica 3 volume with quorum = 2. Given the special nature of posix

>      > locks, AFR sends the lock request sequentially to each one of the

>      > bricks, to avoid that conflicting lock requests from other

>     clients could

>      > require to unlock an already locked region on the client that has

>     not

>      > got enough successful locks (i.e. quorum). An unlock here not

>     only would

>      > cancel the current lock request. It would also cancel any previously

>      > acquired lock.

>      >

> 

>     I may not have fully understood, please correct me. AFAIU, lk xlator

>     merges locks only if both the lk-owner and the client opaque matches.

> 

>     In the case which you have mentioned above, considering clientA

>     acquired

>     locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC

>     alone- clientB now has to unlock/cancel the lock it acquired on nodeC.

> 

>     You are saying the it could pose a problem if there were already

>     successful locks taken by clientB for the same region which would get

>     unlocked by this particular unlock request..right?

> 

>     Assuming the previous locks acquired by clientB are shared (otherwise

>     clientA wouldn't have got granted lock for the same region on nodeA &

>     nodeB), they would still hold true on nodeA & nodeB  as the unlock

>     request was sent to only nodeC. Since the majority of quorum nodes

>     still

>     hold the locks by clientB, this isn't serious issue IMO.

> 

>     I haven't looked into heal part but would like to understand if this is

>     really an issue in normal scenarios as well.

> 

> 

> This is how I understood the code. Consider the following case:

> Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1 

> and lock-range 2-3 from mount-2.

> If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel 

> lets say mount-2 issued unlock of 2-3 as well.

> 

> nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with 

> range 1-7, so the lock is granted and merged to give 1-15

> nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which 

> leads to EAGAIN which will trigger unlocks on granted lock in mount-1 

> which will end up doing unlock of 1-7 on nodeA leading to lock-range 

> 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will 

> have range 5-15.

> 

> Let me know if my understanding is wrong.

Both of us mentioned the same points. So in the example you gave , 

mount-1 lost its previous lock on nodeA but majority of the quorum 

(nodeB and nodeC) still have the previous lock  (range: 5-15) intact. So 

this shouldn't ideally lead to any issues as other conflicting locks are 

blocked or failed by majority of the nodes (provided there are no brick 

dis/re-connects).

But brick disconnects will happen (upgrades, disk failures, server maintenance, ...). Anyway, even without brick disconnects, in the previous example we have nodeA with range 8-15, and nodes B and C with range 5-15. If another lock from mount-2 comes for range 5-7, it will succeed on nodeA, but it will block on nodeB. At this point, mount-1 could attempt a lock on same range. It will block on nodeA, so we have a deadlock.

In general, having discrepancies between bricks is not good because sooner or later it will cause some bad inconsistency.

Wrt to brick disconnects/re-connects, if we can get in general lock 

healing (not getting into implementation details atm) support, that 

should take care of correcting lock range on nodeA as well right?

The problem we have seen is that to be able to correctly heal currently acquired locks on brick reconnect, there are cases where we need to release a lock that has already been granted (because the current owner doesn't have enough quorum and a just recovered connection tries to claim/heal it). In this case we need to deal with locks that have already been merged, but without interfering with other existing locks that already have quorum.

That said I am not suggesting that we should stick to existing behavior, 

just trying to get clarification to check if we can avoid any 

overhead/side-effects with maintaining multiple locks.

Right now is the only way we have found to provide a correct solution both for some cases of concurrent lock/unlock requests, and lock healing.

Regards,

Xavi

Thanks,

Soumya

> 

> 

>     Thanks,

>     Soumya

> 

>      > However, when something goes wrong (a brick dies during a lock

>     request,

>      > or there's a network partition or some other weird situation), it

>     could

>      > happen that even using sequential locking, only one brick

>     succeeds the

>      > lock request. In this case, AFR should cancel the previous lock

>     (and it

>      > does), but this also cancels any previously acquired lock on that

>      > region, which is not good.

>      >

>      > A similar thing can happen if we try to recover (heal) posix

>     locks that

>      > were active after a brick has been disconnected (for any reason) and

>      > then reconnected.

>      >

>      > To fix all these situations we need to change the way posix locks

>     are

>      > managed by locks xlator. One possibility would be to embed the lock

>      > request inside an inode transaction using inodelk. Since inodelks

>     do not

>      > suffer this problem, the follwing posix lock could be sent safely.

>      > However this implies an additional network request, which could

>     cause

>      > some performance impact. Eager-locking could minimize the impact

>     in some

>      > cases. However this approach won't work for lock recovery after a

>      > disconnect.

>      >

>      > Another possibility is to send a special partial posix lock request

>      > which won't be immediately merged with already existing locks once

>      > granted. An additional confirmation request of the partial posix

>     lock

>      > will be required to fully grant the current lock and merge it

>     with the

>      > existing ones. This requires a new network request, which will add

>      > latency, and makes everything more complex since there would be more

>      > combinations of states in which something could fail.

>      >

>      > So I think one possible solution would be the following:

>      >

>      > 1. Keep each posix lock as an independent object in locks xlator.

>     This

>      > will make it possible to "invalidate" any already granted lock

>     without

>      > affecting already established locks.

>      >

>      > 2. Additionally, we'll keep a sorted list of non-overlapping

>     segments of

>      > locked regions. And we'll count, for each region, how many locks are

>      > referencing it. One lock can reference multiple segments, and each

>      > segment can be referenced by multiple locks.

>      >

>      > 3. An additional lock request that overlaps with an existing

>     segment,

>      > can cause this segment to be split to satisfy the non-overlapping

>     property.

>      >

>      > 4. When an unlock request is received, all segments intersecting

>     with

>      > the region are eliminated (it may require some segment splits on the

>      > edges), and the unlocked region is subtracted from each lock

>     associated

>      > to the segment. If a lock gets an empty region, it's removed.

>      >

>      > 5. We'll create a special "remove lock" request that doesn't

>     unlock a

>      > region but removes an already granted lock. This will decrease the

>      > number of references to each of the segments this lock was

>     covering. If

>      > some segment reaches 0, it's removed. Otherwise it remains there.

>     This

>      > special request will only be used internally to cancel already

>     acquired

>      > locks that cannot be fully granted due to quorum issues or any other

>      > problem.

>      >

>      > In some weird cases, the list of segments can be huge (many locks

>      > overlapping only on a single byte, so each segment represents

>     only one

>      > byte). We can try to find some smarter structure that minimizes this

>      > problem or limit the number of segments (for example returning

>     ENOLCK

>      > when there are too many).

>      >

>      > What do you think ?

>      >

>      > Xavi

>      >

>      > _______________________________________________

>      > Gluster-devel mailing list

>      > Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>

>      > https://lists.gluster.org/mailman/listinfo/gluster-devel

>      >

> 

> 

> 

> -- 

> Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel