On Fri, 2025-01-24 at 09:32 -0500, Chuck Lever wrote: > On 1/23/25 3:25 PM, Jeff Layton wrote: > > The current error handling has some problems: > > > > BADSLOT and BADSESSION: don't release the slot before retrying the call > > > > SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any > > recommendation about doing that in the spec, and it seems wrong. > > Random thought: You might use the Linux NFS client's forechannel session > implementation as a code reference. > > > > Handle all three errors the same way: release the slot, but then handle > > it just like we would as if we hadn't gotten a reply; mark the session > > as faulty, and retry the call. > > Some questions: > > Why does it matter whether NFSD keeps the slot if both sides plan to > destroy the session? > It may not be required, but there is no reason to hold onto the slot in these cases. Also, at this point, only nfsd has declared that it needs a new session (see below). > Also, AFAICT marking CB_FAULT does not destroy the session, it simply > tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code > should actively destroy the session and let the client drive a fresh > CREATE_SESSION to recover? > Marking it with a fault just sets the cl_cb_state to NFSD4_CB_FAULT. Then, on the next SEQUENCE call, that makes nfsd set SEQ4_STATUS_BACKCHANNEL_FAULT, which should make the client recreate the session. Obviously, there is some delay involved there since we might have to wait for the client to do a lease renewal before this happens. > > > Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > --- > > fs/nfsd/nfs4callback.c | 27 +++++++++++---------------- > > 1 file changed, 11 insertions(+), 16 deletions(-) > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644 > > --- a/fs/nfsd/nfs4callback.c > > +++ b/fs/nfsd/nfs4callback.c > > @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > nfsd4_mark_cb_fault(cb->cb_clp); > > ret = false; > > break; > > + case -NFS4ERR_BADSESSION: > > + case -NFS4ERR_BADSLOT: > > + case -NFS4ERR_SEQ_MISORDERED: > > + /* > > + * These errors indicate that something has gone wrong > > + * with the server and client's synchronization. Release > > + * the slot, but handle it as if we hadn't gotten a reply. > > + */ > > + nfsd41_cb_release_slot(cb); > > + fallthrough; > > case 1: > > /* > > * cb_seq_status remains 1 if an RPC Reply was never > > * received. NFSD can't know if the client processed > > * the CB_SEQUENCE operation. Ask the client to send a > > - * DESTROY_SESSION to recover. > > + * DESTROY_SESSION to recover, but keep the slot. > > */ > > - fallthrough; > > - case -NFS4ERR_BADSESSION: > > nfsd4_mark_cb_fault(cb->cb_clp); > > - ret = false; > > goto need_restart; > > case -NFS4ERR_DELAY: > > cb->cb_seq_status = 1; > > @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > rpc_delay(task, 2 * HZ); > > return false; > > - case -NFS4ERR_BADSLOT: > > - goto retry_nowait; > > - case -NFS4ERR_SEQ_MISORDERED: > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > - goto retry_nowait; > > - } > > - break; > > default: > > nfsd4_mark_cb_fault(cb->cb_clp); > > } > > @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > nfsd41_cb_release_slot(cb); > > out: > > return ret; > > -retry_nowait: > > - if (rpc_restart_call_prepare(task)) > > - ret = false; > > - goto out; > > need_restart: > > if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { > > trace_nfsd_cb_restart(clp, cb); > > > > -- Jeff Layton <jlayton@xxxxxxxxxx>