Re: [PATCH 2/2] NFSv4: Fix dropped lock for racing OPEN and delegation return

Benjamin Coddington <bcodding@xxxxxxxxxx> · Thu, 29 Jun 2023 15:01:05 -0400

On 29 Jun 2023, at 14:33, Trond Myklebust wrote:

> On Tue, 2023-06-27 at 14:31 -0400, Benjamin Coddington wrote:
>> Commmit f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during
>> delegation
>> return") attempted to solve this problem by using nfs4's generic
>> async error
>> handling, but introduced a regression where v4.0 lock recovery would
>> hang.
>> The additional complexity introduced by overloading that error
>> handling is
>> not necessary for this case.
>>
>> The problem as originally explained in the above commit is:
>>
>>     There's a small window where a LOCK sent during a delegation
>> return can
>>     race with another OPEN on client, but the open stateid has not
>> yet been
>>     updated.  In this case, the client doesn't handle the OLD_STATEID
>> error
>>     from the server and will lose this lock, emitting:
>>     "NFS: nfs4_handle_delegation_recall_error: unhandled error -
>> 10024".
>>
>> We want a fix that is much more focused to the original problem.  Fix
>> this
>> issue by returning -EAGAIN from the
>> nfs4_handle_delegation_recall_error() on
>> OLD_STATEID, and use that as a signal for the delegation return code
>> to
>> retry the LOCK operation.  We should at this point be able to send
>> along
>> the updated stateid.
>>
>> Signed-off-by: Benjamin Coddington <bcodding@xxxxxxxxxx>
>> ---
>>  fs/nfs/delegation.c | 4 +++-
>>  fs/nfs/nfs4proc.c   | 1 +
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>> index cf7365581031..23aeb02319a5 100644
>> --- a/fs/nfs/delegation.c
>> +++ b/fs/nfs/delegation.c
>> @@ -160,7 +160,9 @@ static int nfs_delegation_claim_locks(struct
>> nfs4_state *state, const nfs4_state
>>                 if (nfs_file_open_context(fl->fl_file)->state !=
>> state)
>>                         continue;
>>                 spin_unlock(&flctx->flc_lock);
>> -               status = nfs4_lock_delegation_recall(fl, state,
>> stateid);
>> +               do {
>> +                       status = nfs4_lock_delegation_recall(fl,
>> state, stateid);
>> +               } while (status == -EAGAIN);
>>                 if (status < 0)
>>                         goto out;
>>                 spin_lock(&flctx->flc_lock);
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index 6bb14f6cfbc0..399db73a57f4 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -2262,6 +2262,7 @@ static int
>> nfs4_handle_delegation_recall_error(struct nfs_server *server, struct
>>                 case -NFS4ERR_BAD_HIGH_SLOT:
>>                 case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
>>                 case -NFS4ERR_DEADSESSION:
>> +               case -NFS4ERR_OLD_STATEID:
>>                         return -EAGAIN;
>
> Hmm... Rather than issuing a blanket EAGAIN, we really should be
> looking at using either nfs4_refresh_lock_old_stateid() or
> nfs4_refresh_open_old_stateid(), depending on whether the stateid that
> saw the NFS4ERR_OLD_STATEID was a lock stateid or an open stateid.

I figured if there was client race that would trigger the OLD_STATEID on
open, we'd have heard from the "unhandled error" printk by now, so I rushed
this out..  But I also don't like the overloading for open error handling
here.  I'll work it up as you suggest.

The revert can go without a fix (IMO).  The fix is worse than original bug
which was really hard to hit.  I couldn't reproduce it without artificial
delays.

Ben