On Fri, 2025-01-24 at 10:31 -0500, Chuck Lever wrote: > On 1/24/25 9:46 AM, Jeff Layton wrote: > > On Fri, 2025-01-24 at 09:32 -0500, Chuck Lever wrote: > > > On 1/23/25 3:25 PM, Jeff Layton wrote: > > > > The current error handling has some problems: > > > > > > > > BADSLOT and BADSESSION: don't release the slot before retrying the call > > > > > > > > SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any > > > > recommendation about doing that in the spec, and it seems wrong. > > > > > > Random thought: You might use the Linux NFS client's forechannel session > > > implementation as a code reference. > > > > > > > > > > Handle all three errors the same way: release the slot, but then handle > > > > it just like we would as if we hadn't gotten a reply; mark the session > > > > as faulty, and retry the call. > > > > > > Some questions: > > > > > > Why does it matter whether NFSD keeps the slot if both sides plan to > > > destroy the session? > > > > > > > It may not be required, but there is no reason to hold onto the slot in > > these cases. > > In the BADSLOT case, if the slot is released, then another session > consumer on the NFS server can use it and will encounter the same error. > Best to keep it in the penalty box, IMO. > > If there are other slots, they are likely still usable. An > implementation can choose to continue using the session rather than > scuttling it immediately. In the past, with a single backchannel slot, > NFSD had no choice but to replace the session. But now it can be more > conservative. > It may still be usable, but if we got BADSLOT, then it's still indicative of a faulty session. Best to mark it as such, IMO. It's worth noting that these things should really _never_ happen, so being somewhat heavy-handed in this situation seems reasonable. > > > Also, at this point, only nfsd has declared that it needs > > a new session (see below). > > If the client's backchannel service has returned BADSESSION, then the > client already knows this session is unusable. > Fair point. I still think it's reasonable to mark it CB_FAULT in that case. That status is also displayed via the client_info file, so that's helpful info. > > > > Also, AFAICT marking CB_FAULT does not destroy the session, it simply > > > tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code > > > should actively destroy the session and let the client drive a fresh > > > CREATE_SESSION to recover? > > > > > > > Marking it with a fault just sets the cl_cb_state to NFSD4_CB_FAULT. > > Then, on the next SEQUENCE call, that makes nfsd set > > SEQ4_STATUS_BACKCHANNEL_FAULT, which should make the client recreate > > the session. Obviously, there is some delay involved there since we > > might have to wait for the client to do a lease renewal before this > > happens. > > > > > > > > > Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > > > --- > > > > fs/nfsd/nfs4callback.c | 27 +++++++++++---------------- > > > > 1 file changed, 11 insertions(+), 16 deletions(-) > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644 > > > > --- a/fs/nfsd/nfs4callback.c > > > > +++ b/fs/nfsd/nfs4callback.c > > > > @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > ret = false; > > > > break; > > > > + case -NFS4ERR_BADSESSION: > > > > + case -NFS4ERR_BADSLOT: > > > > + case -NFS4ERR_SEQ_MISORDERED: > > > > + /* > > > > + * These errors indicate that something has gone wrong > > > > + * with the server and client's synchronization. Release > > > > + * the slot, but handle it as if we hadn't gotten a reply. > > > > + */ > > > > + nfsd41_cb_release_slot(cb); > > > > + fallthrough; > > > > case 1: > > > > /* > > > > * cb_seq_status remains 1 if an RPC Reply was never > > > > * received. NFSD can't know if the client processed > > > > * the CB_SEQUENCE operation. Ask the client to send a > > > > - * DESTROY_SESSION to recover. > > > > + * DESTROY_SESSION to recover, but keep the slot. > > > > */ > > > > - fallthrough; > > > > - case -NFS4ERR_BADSESSION: > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > - ret = false; > > > > goto need_restart; > > > > case -NFS4ERR_DELAY: > > > > cb->cb_seq_status = 1; > > > > @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > > > rpc_delay(task, 2 * HZ); > > > > return false; > > > > - case -NFS4ERR_BADSLOT: > > > > - goto retry_nowait; > > > > - case -NFS4ERR_SEQ_MISORDERED: > > > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > - goto retry_nowait; > > > > - } > > > > - break; > > > > default: > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > } > > > > @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > nfsd41_cb_release_slot(cb); > > > > out: > > > > return ret; > > > > -retry_nowait: > > > > - if (rpc_restart_call_prepare(task)) > > > > - ret = false; > > > > - goto out; > > > > need_restart: > > > > if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { > > > > trace_nfsd_cb_restart(clp, cb); > > > > > > > > > > > > > > -- Jeff Layton <jlayton@xxxxxxxxxx>