Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the client. While a callback job is technically an RPC, this has the effect of preventing the client from being unhashed. If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we can end up in a situation where the callback can't complete on the (now dead) callback channel, but the new client also can't connect because the old client can't be unhashed. The job is only holding a reference to the client so it can clear a flag in the client after it completes. This patch attempts to alleviate this by having the job instead hold a reference to the cl_nfsdfs.cl_ref. Typically we only take that sort of reference when dealing with the nfsdfs info files, but it should work appropriately here too. Reported-by: Vladimir Benes <vbenes@xxxxxxxxxx> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> --- This patch seems to break the livelock in my testing. It's not the prettiest fix, but it's narrowly targeted and should be appropriate for 6.9-rc. Longer term, I think we need to rework how the nfs4_clients refcounts are managed, but that's a larger project. Many thanks to Vladimir for all his help with tracking this down! --- fs/nfsd/nfs4state.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 5fcd93f7cb8c..4311d608a297 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -3042,12 +3042,9 @@ static void nfsd4_cb_recall_any_release(struct nfsd4_callback *cb) { struct nfs4_client *clp = cb->cb_clp; - struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); - spin_lock(&nn->client_lock); clear_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags); - put_client_renew_locked(clp); - spin_unlock(&nn->client_lock); + drop_client(clp); } static int @@ -6613,10 +6610,12 @@ deleg_reaper(struct nfsd_net *nn) clp->cl_ra_time < 5)) { continue; } - list_add(&clp->cl_ra_cblist, &cblist); /* release in nfsd4_cb_recall_any_release */ - atomic_inc(&clp->cl_rpc_users); + if (!kref_get_unless_zero(&clp->cl_nfsdfs.cl_ref)) + continue; + + list_add(&clp->cl_ra_cblist, &cblist); set_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags); clp->cl_ra_time = ktime_get_boottime_seconds(); } --- base-commit: 05258a0a69b3c5d2c003f818702c0a52b6fea861 change-id: 20240405-rhel-31513-028ab6f14252 Best regards, -- Jeff Layton <jlayton@xxxxxxxxxx>