Re: NFSD threads hang when destroying a session or client ID

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 23 Jan 2025 09:22:30 -0500

On 1/23/25 8:50 AM, Jeff Layton via Bugspray Bot wrote:
Jeff Layton writes via Kernel.org Bugzilla:

There is another scenario that could explain a hang here. From nfsd4_cb_sequence_done():

------------------8<---------------------
         case -NFS4ERR_BADSLOT:
                 goto retry_nowait;
         case -NFS4ERR_SEQ_MISORDERED:
                 if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) {
                         session->se_cb_seq_nr[cb->cb_held_slot] = 1;
                         goto retry_nowait;
                 }
                 break;
         default:
                 nfsd4_mark_cb_fault(cb->cb_clp);
         }
         trace_nfsd_cb_free_slot(task, cb);
         nfsd41_cb_release_slot(cb);

         if (RPC_SIGNALLED(task))
                 goto need_restart;
out:
         return ret;
retry_nowait:
         if (rpc_restart_call_prepare(task))
                 ret = false;
         goto out;
------------------8<---------------------

Since it doesn't check RPC_SIGNALLED in the v4.1+ case until very late in the function, it's possible to get a BADSLOT or SEQ_MISORDERED error that causes the callback client to immediately resubmit the rpc_task to the RPC engine without resubmitting to the callback workqueue.

I think that we should assume that when RPC_SIGNALLED returns true that the result is suspect, and that we should halt further processing into the CB_SEQUENCE response and restart the callback.

When cb->cb_seq_status is set to any value other than 1, that means the
client replied successfully. RPC_SIGNALLED has nothing to do with
whether a reply is suspect, it means only that the rpc_clnt has been
asked to terminate.

The potential loop you noticed is concerning, but I haven't seen
evidence in the "echo t > sysrq-trigger" output that there is a running
RPC such as you described here.

--
Chuck Lever