Re: Issue with posix locks

Soumya Koduri <skoduri@xxxxxxxxxx> · Sun, 31 Mar 2019 23:28:58 +0530

On 3/29/19 11:55 PM, Xavi Hernandez wrote:
Hi all,

there is one potential problem with posix locks when used in a 
replicated or dispersed volume.

Some background:

Posix locks allow any process to lock a region of a file multiple times, 
but a single unlock on a given region will release all previous locks. 
Locked regions can be different for each lock request and they can 
overlap. The resulting lock will cover the union of all locked regions. 
A single unlock (the region doesn't necessarily need to match any of the 
ranges used for locking) will create a "hole" in the currently locked 
region, independently of how many times a lock request covered that region.

For this reason, the locks xlator simply combines the locked regions 
that are requested, but it doesn't track each individual lock range.

Under normal circumstances this works fine. But there are some cases 
where this behavior is not sufficient. For example, suppose we have a 
replica 3 volume with quorum = 2. Given the special nature of posix 
locks, AFR sends the lock request sequentially to each one of the 
bricks, to avoid that conflicting lock requests from other clients could 
require to unlock an already locked region on the client that has not 
got enough successful locks (i.e. quorum). An unlock here not only would 
cancel the current lock request. It would also cancel any previously 
acquired lock.

I may not have fully understood, please correct me. AFAIU, lk xlator 
merges locks only if both the lk-owner and the client opaque matches.

In the case which you have mentioned above, considering clientA acquired 
locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC 
alone- clientB now has to unlock/cancel the lock it acquired on nodeC.

You are saying the it could pose a problem if there were already 
successful locks taken by clientB for the same region which would get 
unlocked by this particular unlock request..right?

Assuming the previous locks acquired by clientB are shared (otherwise 
clientA wouldn't have got granted lock for the same region on nodeA & 
nodeB), they would still hold true on nodeA & nodeB  as the unlock 
request was sent to only nodeC. Since the majority of quorum nodes still 
hold the locks by clientB, this isn't serious issue IMO.

I haven't looked into heal part but would like to understand if this is 
really an issue in normal scenarios as well.

Thanks,
Soumya

However, when something goes wrong (a brick dies during a lock request, 
or there's a network partition or some other weird situation), it could 
happen that even using sequential locking, only one brick succeeds the 
lock request. In this case, AFR should cancel the previous lock (and it 
does), but this also cancels any previously acquired lock on that 
region, which is not good.

A similar thing can happen if we try to recover (heal) posix locks that 
were active after a brick has been disconnected (for any reason) and 
then reconnected.

To fix all these situations we need to change the way posix locks are 
managed by locks xlator. One possibility would be to embed the lock 
request inside an inode transaction using inodelk. Since inodelks do not 
suffer this problem, the follwing posix lock could be sent safely. 
However this implies an additional network request, which could cause 
some performance impact. Eager-locking could minimize the impact in some 
cases. However this approach won't work for lock recovery after a 
disconnect.

Another possibility is to send a special partial posix lock request 
which won't be immediately merged with already existing locks once 
granted. An additional confirmation request of the partial posix lock 
will be required to fully grant the current lock and merge it with the 
existing ones. This requires a new network request, which will add 
latency, and makes everything more complex since there would be more 
combinations of states in which something could fail.

So I think one possible solution would be the following:

1. Keep each posix lock as an independent object in locks xlator. This 
will make it possible to "invalidate" any already granted lock without 
affecting already established locks.

2. Additionally, we'll keep a sorted list of non-overlapping segments of 
locked regions. And we'll count, for each region, how many locks are 
referencing it. One lock can reference multiple segments, and each 
segment can be referenced by multiple locks.

3. An additional lock request that overlaps with an existing segment, 
can cause this segment to be split to satisfy the non-overlapping property.

4. When an unlock request is received, all segments intersecting with 
the region are eliminated (it may require some segment splits on the 
edges), and the unlocked region is subtracted from each lock associated 
to the segment. If a lock gets an empty region, it's removed.

5. We'll create a special "remove lock" request that doesn't unlock a 
region but removes an already granted lock. This will decrease the 
number of references to each of the segments this lock was covering. If 
some segment reaches 0, it's removed. Otherwise it remains there. This 
special request will only be used internally to cancel already acquired 
locks that cannot be fully granted due to quorum issues or any other 
problem.

In some weird cases, the list of segments can be huge (many locks 
overlapping only on a single byte, so each segment represents only one 
byte). We can try to find some smarter structure that minimizes this 
problem or limit the number of segments (for example returning ENOLCK 
when there are too many).

What do you think ?

Xavi

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel