On Sat, 2025-01-25 at 11:24 -0500, Chuck Lever wrote: > On 1/23/25 3:25 PM, Jeff Layton wrote: > > This is problematic, since the RPC might have been entirely successful. > > There is no point in restarting a v4.1+ callback just because > > RPC_SIGNALLED is true. The v4.1+ error handling has other mechanisms for > > detecting when it should retransmit the RPC. > > > > Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > --- > > fs/nfsd/nfs4callback.c | 3 --- > > 1 file changed, 3 deletions(-) > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > index 50e468bdb8d4838b5217346dcc2bd0fec1765c1a..e12205ef16ca932ffbcc86d67b0817aec2436c89 100644 > > --- a/fs/nfsd/nfs4callback.c > > +++ b/fs/nfsd/nfs4callback.c > > @@ -1403,9 +1403,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > } > > trace_nfsd_cb_free_slot(task, cb); > > nfsd41_cb_release_slot(cb); > > - > > - if (RPC_SIGNALLED(task)) > > - goto need_restart; > > out: > > return ret; > > retry_nowait: > > > > I too am skeptical about this logic, but I don't entirely understand it > yet. More importantly, though, I don't recall seeing (mis)behavior that > I can directly attribute to it, so I can't yet confirm or deny your > assertion that "This is problematic". > I haven't seen behavior that I can directly attribute to this either, but we have seen a number of strange panics and weird behaviors in the callback code over the years that may be related. At this point, I think you're correct that we will probably need to do more than just small, incremental changes here. > Before making a code change here, let's gather a little evidence of a > real problem. For instance, we might want to replace this logic with > something better rather than wholesale removing it. > > You might start by enabling aggressive disconnect injection to see how > backchannel recovery works (or that it doesn't work!). I'm trying this > on my kdevops NFSD while running fstests: > > cd /sys/kernel/debug/fail_sunrpc/ > echo Y > ignore-cache-wait > echo Y > ignore-client-disconnect > echo 24847 > interval > echo 97 > times > echo 100 > probability > > Unfortunately, I've found an even bigger problem in the callback code. It accesses the clp->cl_cb_session pointer when processing the call and reply, but that pointer doesn't imply a reference and nothing else ensures that the nfsd4_session object will stick around while this happens. I think a callback can race with a DESTROY_SESSION and cause a UAF. I started working on patches to fix this up, but it's a bit complex and will take some time. Please don't apply any of these until I get a better picture of what will need to be changed. Stay tuned! -- Jeff Layton <jlayton@xxxxxxxxxx>