On Fri, 2024-04-05 at 13:00 -0400, Jeff Layton wrote: > Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the > client. While a callback job is technically an RPC, this has the effect > of preventing the client from being unhashed. > > If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we > can end up in a situation where the callback can't complete on the (now > dead) callback channel, but the new client also can't connect because > the old client can't be unhashed. > > The job is only holding a reference to the client so it can clear a flag > in the client after it completes. This patch attempts to alleviate this > by having the job instead hold a reference to the cl_nfsdfs.cl_ref. > > Typically we only take that sort of reference when dealing with the > nfsdfs info files, but it should work appropriately here too. > This should probably also have: Fixes: 44df6f439a17 NFSD: add delegation reaper to react to low memory condition > Reported-by: Vladimir Benes <vbenes@xxxxxxxxxx> > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > --- > This patch seems to break the livelock in my testing. It's not the > prettiest fix, but it's narrowly targeted and should be appropriate for > 6.9-rc. Longer term, I think we need to rework how the nfs4_clients > refcounts are managed, but that's a larger project. > > Many thanks to Vladimir for all his help with tracking this down! > --- > fs/nfsd/nfs4state.c | 11 +++++------ > 1 file changed, 5 insertions(+), 6 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 5fcd93f7cb8c..4311d608a297 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -3042,12 +3042,9 @@ static void > nfsd4_cb_recall_any_release(struct nfsd4_callback *cb) > { > struct nfs4_client *clp = cb->cb_clp; > - struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); > > - spin_lock(&nn->client_lock); > clear_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags); > - put_client_renew_locked(clp); > - spin_unlock(&nn->client_lock); > + drop_client(clp); > } > > static int > @@ -6613,10 +6610,12 @@ deleg_reaper(struct nfsd_net *nn) > clp->cl_ra_time < 5)) { > continue; > } > - list_add(&clp->cl_ra_cblist, &cblist); > > /* release in nfsd4_cb_recall_any_release */ > - atomic_inc(&clp->cl_rpc_users); > + if (!kref_get_unless_zero(&clp->cl_nfsdfs.cl_ref)) > + continue; > + > + list_add(&clp->cl_ra_cblist, &cblist); > set_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags); > clp->cl_ra_time = ktime_get_boottime_seconds(); > } > > --- > base-commit: 05258a0a69b3c5d2c003f818702c0a52b6fea861 > change-id: 20240405-rhel-31513-028ab6f14252 > > Best regards, -- Jeff Layton <jlayton@xxxxxxxxxx>