Re: nfsd4 laundromat_main hung tasks

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 13 Jan 2025 17:12:12 -0500

On 1/12/25 7:42 AM, Rik Theys wrote:
On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:

On 1/10/25 3:51 PM, Rik Theys wrote:
Are there any debugging commands we can run once the issue happens
that can help to determine the cause of this issue?

Once the issue happens, the precipitating bug has already done its
damage, so at that point it is too late.

I've studied the code and bug reports a bit. I see one intriguing
mention in comment #5:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5

/proc/130/stack:
[<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
[<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
[<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
[<0>] process_one_work+0x1c7/0x380
[<0>] worker_thread+0x4d/0x380
[<0>] kthread+0xda/0x100
[<0>] ret_from_fork+0x22/0x30

This tells me that the active item on the callback_wq is waiting for the
backchannel RPC client to shut down. This is probably the proximal cause
of the callback workqueue stall.

rpc_shutdown_client() is waiting for the client's cl_tasks to become
empty. Typically this is a short wait. But here, there's one or more RPC
requests that are not completing.

Please issue these two commands on your server once it gets into the
hung state:

# rpcdebug -m rpc -c
# echo t > /proc/sysrq-trigger

Then gift-wrap the server's system journal and send it to me. I need to
see only the output from these two commands, so if you want to
anonymize the journal and truncate it to just the day of the failure,
I think that should be fine.

--
Chuck Lever