Re: [PATCH 2/8] nfsd: fix CB_SEQUENCE error handling of NFS4ERR_{BADSLOT,BADSESSION,SEQ_MISORDERED}

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 24 Jan 2025 10:31:55 -0500

On 1/24/25 9:46 AM, Jeff Layton wrote:
On Fri, 2025-01-24 at 09:32 -0500, Chuck Lever wrote:
On 1/23/25 3:25 PM, Jeff Layton wrote:
The current error handling has some problems:

BADSLOT and BADSESSION: don't release the slot before retrying the call

SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any
recommendation about doing that in the spec, and it seems wrong.

Random thought: You might use the Linux NFS client's forechannel session
implementation as a code reference.


Handle all three errors the same way: release the slot, but then handle
it just like we would as if we hadn't gotten a reply; mark the session
as faulty, and retry the call.

Some questions:

Why does it matter whether NFSD keeps the slot if both sides plan to
destroy the session?


It may not be required, but there is no reason to hold onto the slot in
these cases.

In the BADSLOT case, if the slot is released, then another session
consumer on the NFS server can use it and will encounter the same error.
Best to keep it in the penalty box, IMO.

If there are other slots, they are likely still usable. An
implementation can choose to continue using the session rather than
scuttling it immediately. In the past, with a single backchannel slot,
NFSD had no choice but to replace the session. But now it can be more
conservative.


Also, at this point, only nfsd has declared that it needs
a new session (see below).

If the client's backchannel service has returned BADSESSION, then the
client already knows this session is unusable.


Also, AFAICT marking CB_FAULT does not destroy the session, it simply
tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code
should actively destroy the session and let the client drive a fresh
CREATE_SESSION to recover?


Marking it with a fault just sets the cl_cb_state to NFSD4_CB_FAULT.
Then, on the next SEQUENCE call, that makes nfsd set
SEQ4_STATUS_BACKCHANNEL_FAULT, which should make the client recreate
the session. Obviously, there is some delay involved there since we
might have to wait for the client to do a lease renewal before this
happens.


Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors")
Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
---
   fs/nfsd/nfs4callback.c | 27 +++++++++++----------------
   1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
   		nfsd4_mark_cb_fault(cb->cb_clp);
   		ret = false;
   		break;
+	case -NFS4ERR_BADSESSION:
+	case -NFS4ERR_BADSLOT:
+	case -NFS4ERR_SEQ_MISORDERED:
+		/*
+		 * These errors indicate that something has gone wrong
+		 * with the server and client's synchronization. Release
+		 * the slot, but handle it as if we hadn't gotten a reply.
+		 */
+		nfsd41_cb_release_slot(cb);
+		fallthrough;
   	case 1:
   		/*
   		 * cb_seq_status remains 1 if an RPC Reply was never
   		 * received. NFSD can't know if the client processed
   		 * the CB_SEQUENCE operation. Ask the client to send a
-		 * DESTROY_SESSION to recover.
+		 * DESTROY_SESSION to recover, but keep the slot.
   		 */
-		fallthrough;
-	case -NFS4ERR_BADSESSION:
   		nfsd4_mark_cb_fault(cb->cb_clp);
-		ret = false;
   		goto need_restart;
   	case -NFS4ERR_DELAY:
   		cb->cb_seq_status = 1;
@@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
   
   		rpc_delay(task, 2 * HZ);
   		return false;
-	case -NFS4ERR_BADSLOT:
-		goto retry_nowait;
-	case -NFS4ERR_SEQ_MISORDERED:
-		if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) {
-			session->se_cb_seq_nr[cb->cb_held_slot] = 1;
-			goto retry_nowait;
-		}
-		break;
   	default:
   		nfsd4_mark_cb_fault(cb->cb_clp);
   	}
@@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
   	nfsd41_cb_release_slot(cb);
   out:
   	return ret;
-retry_nowait:
-	if (rpc_restart_call_prepare(task))
-		ret = false;
-	goto out;
   need_restart:
   	if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
   		trace_nfsd_cb_restart(clp, cb);






--
Chuck Lever