Re: nfsd4 laundromat_main hung tasks

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 14 Jan 2025 11:10:04 -0500

On 1/14/25 10:30 AM, Rik Theys wrote:
Hi,

On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:

On 1/14/25 3:23 AM, Rik Theys wrote:
Hi,

On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:

On 1/12/25 7:42 AM, Rik Theys wrote:
On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:

On 1/10/25 3:51 PM, Rik Theys wrote:
Are there any debugging commands we can run once the issue happens
that can help to determine the cause of this issue?

Once the issue happens, the precipitating bug has already done its
damage, so at that point it is too late.

I've studied the code and bug reports a bit. I see one intriguing
mention in comment #5:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5

/proc/130/stack:
[<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
[<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
[<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
[<0>] process_one_work+0x1c7/0x380
[<0>] worker_thread+0x4d/0x380
[<0>] kthread+0xda/0x100
[<0>] ret_from_fork+0x22/0x30

This tells me that the active item on the callback_wq is waiting for the
backchannel RPC client to shut down. This is probably the proximal cause
of the callback workqueue stall.

rpc_shutdown_client() is waiting for the client's cl_tasks to become
empty. Typically this is a short wait. But here, there's one or more RPC
requests that are not completing.

Please issue these two commands on your server once it gets into the
hung state:

# rpcdebug -m rpc -c
# echo t > /proc/sysrq-trigger

There were no rpcdebug entries configured, so I don't think the first
command did much.

You can find the output from the second command in attach.

I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
do see is a large number of OOM-killer notices. So, your server is out
of memory. But I think this is due to a memory leak bug, probably this
one:

I'm confused. Where do you see the OOM-killer notices in the log I provided?

Never mind: my editor opened an old file at the same time. The window
with your dump was on another screen.

Looking at your journal now.

The first lines of the file is after increasing the hung_task_warnings
and waiting a few minutes. This triggered the hun task on the nfsd4
laundromat_main workqueue:

Workqueue: nfsd4 laundromat_main [nfsd]
Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:

Then I executed the commands you provided. Are the lines after the

Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State

Line not the output you're looking for?

Regards,
Rik

     https://bugzilla.kernel.org/show_bug.cgi?id=219535

So that explains the source of the frequent deleg_reaper() calls on your
system; it's the shrinker. (Note these calls are not the actual problem.
The real bug is somewhere in the callback code, but frequent callbacks
are making it easy to hit the callback bug).

Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
for more than 120 seconds." in the journal before issuing the rpcdebug
and "echo t" commands.

Regards,
Rik

Then gift-wrap the server's system journal and send it to me. I need to
see only the output from these two commands, so if you want to
anonymize the journal and truncate it to just the day of the failure,
I think that should be fine.

--
Chuck Lever

--
Chuck Lever

--
Chuck Lever