On 29 Jun 2023, at 14:33, Trond Myklebust wrote: > On Tue, 2023-06-27 at 14:31 -0400, Benjamin Coddington wrote: >> Commmit f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during >> delegation >> return") attempted to solve this problem by using nfs4's generic >> async error >> handling, but introduced a regression where v4.0 lock recovery would >> hang. >> The additional complexity introduced by overloading that error >> handling is >> not necessary for this case. >> >> The problem as originally explained in the above commit is: >> >> There's a small window where a LOCK sent during a delegation >> return can >> race with another OPEN on client, but the open stateid has not >> yet been >> updated. In this case, the client doesn't handle the OLD_STATEID >> error >> from the server and will lose this lock, emitting: >> "NFS: nfs4_handle_delegation_recall_error: unhandled error - >> 10024". >> >> We want a fix that is much more focused to the original problem. Fix >> this >> issue by returning -EAGAIN from the >> nfs4_handle_delegation_recall_error() on >> OLD_STATEID, and use that as a signal for the delegation return code >> to >> retry the LOCK operation. We should at this point be able to send >> along >> the updated stateid. >> >> Signed-off-by: Benjamin Coddington <bcodding@xxxxxxxxxx> >> --- >> fs/nfs/delegation.c | 4 +++- >> fs/nfs/nfs4proc.c | 1 + >> 2 files changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c >> index cf7365581031..23aeb02319a5 100644 >> --- a/fs/nfs/delegation.c >> +++ b/fs/nfs/delegation.c >> @@ -160,7 +160,9 @@ static int nfs_delegation_claim_locks(struct >> nfs4_state *state, const nfs4_state >> if (nfs_file_open_context(fl->fl_file)->state != >> state) >> continue; >> spin_unlock(&flctx->flc_lock); >> - status = nfs4_lock_delegation_recall(fl, state, >> stateid); >> + do { >> + status = nfs4_lock_delegation_recall(fl, >> state, stateid); >> + } while (status == -EAGAIN); >> if (status < 0) >> goto out; >> spin_lock(&flctx->flc_lock); >> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c >> index 6bb14f6cfbc0..399db73a57f4 100644 >> --- a/fs/nfs/nfs4proc.c >> +++ b/fs/nfs/nfs4proc.c >> @@ -2262,6 +2262,7 @@ static int >> nfs4_handle_delegation_recall_error(struct nfs_server *server, struct >> case -NFS4ERR_BAD_HIGH_SLOT: >> case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION: >> case -NFS4ERR_DEADSESSION: >> + case -NFS4ERR_OLD_STATEID: >> return -EAGAIN; > > Hmm... Rather than issuing a blanket EAGAIN, we really should be > looking at using either nfs4_refresh_lock_old_stateid() or > nfs4_refresh_open_old_stateid(), depending on whether the stateid that > saw the NFS4ERR_OLD_STATEID was a lock stateid or an open stateid. I figured if there was client race that would trigger the OLD_STATEID on open, we'd have heard from the "unhandled error" printk by now, so I rushed this out.. But I also don't like the overloading for open error handling here. I'll work it up as you suggest. The revert can go without a fix (IMO). The fix is worse than original bug which was really hard to hit. I couldn't reproduce it without artificial delays. Ben