Re: nfsd4 laundromat_main hung tasks

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 14 Jan 2025 14:02:04 -0500

On 1/14/25 11:10 AM, Chuck Lever wrote:
On 1/14/25 10:30 AM, Rik Theys wrote:
Hi,

On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@xxxxxxxxxx> 
wrote:

On 1/14/25 3:23 AM, Rik Theys wrote:
Hi,

On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever 
<chuck.lever@xxxxxxxxxx> wrote:

On 1/12/25 7:42 AM, Rik Theys wrote:
On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever 
<chuck.lever@xxxxxxxxxx> wrote:

On 1/10/25 3:51 PM, Rik Theys wrote:
Are there any debugging commands we can run once the issue happens
that can help to determine the cause of this issue?

Once the issue happens, the precipitating bug has already done its
damage, so at that point it is too late.

I've studied the code and bug reports a bit. I see one intriguing
mention in comment #5:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5

/proc/130/stack:
[<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
[<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
[<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
[<0>] process_one_work+0x1c7/0x380
[<0>] worker_thread+0x4d/0x380
[<0>] kthread+0xda/0x100
[<0>] ret_from_fork+0x22/0x30

This tells me that the active item on the callback_wq is waiting 
for the
backchannel RPC client to shut down. This is probably the proximal 
cause
of the callback workqueue stall.

rpc_shutdown_client() is waiting for the client's cl_tasks to become
empty. Typically this is a short wait. But here, there's one or 
more RPC
requests that are not completing.

Please issue these two commands on your server once it gets into the
hung state:

# rpcdebug -m rpc -c
# echo t > /proc/sysrq-trigger

There were no rpcdebug entries configured, so I don't think the first
command did much.

The laundromat failure mode is not blocked in rpc_shutdown_client, so
there aren't any outstanding callback RPCs to observe.

The DESTROY_SESSION failure mode is blocking on the flush_workqueue call
in nfsd4_shutdown_callback(), while this failure mode appears to have
passed that call and blocked on the wait for in-flight RPCs to go to
zero (as Jeff analyzed a few days ago).

Next time it happens, please collect the rpcdebug and "echo t" output as
well as the trace-cmd output.

You can find the output from the second command in attach.

I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
do see is a large number of OOM-killer notices. So, your server is out
of memory. But I think this is due to a memory leak bug, probably this
one:

I'm confused. Where do you see the OOM-killer notices in the log I 
provided?

Never mind: my editor opened an old file at the same time. The window
with your dump was on another screen.

Looking at your journal now.

The first lines of the file is after increasing the hung_task_warnings
and waiting a few minutes. This triggered the hun task on the nfsd4
laundromat_main workqueue:

Workqueue: nfsd4 laundromat_main [nfsd]
Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:

Then I executed the commands you provided. Are the lines after the

Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State

Line not the output you're looking for?

Regards,
Rik

     https://bugzilla.kernel.org/show_bug.cgi?id=219535

So that explains the source of the frequent deleg_reaper() calls on your
system; it's the shrinker. (Note these calls are not the actual problem.
The real bug is somewhere in the callback code, but frequent callbacks
are making it easy to hit the callback bug).

Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
for more than 120 seconds." in the journal before issuing the rpcdebug
and "echo t" commands.

Regards,
Rik

Then gift-wrap the server's system journal and send it to me. I 
need to
see only the output from these two commands, so if you want to
anonymize the journal and truncate it to just the day of the failure,
I think that should be fine.

--
Chuck Lever

--
Chuck Lever

--
Chuck Lever