Re: nfsd4 laundromat_main hung tasks

Rik Theys <rik.theys@xxxxxxxxx> · Tue, 14 Jan 2025 16:30:44 +0100

Hi,

On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
> On 1/14/25 3:23 AM, Rik Theys wrote:
> > Hi,
> >
> > On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> >>
> >> On 1/12/25 7:42 AM, Rik Theys wrote:
> >>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> >>>>
> >>>> On 1/10/25 3:51 PM, Rik Theys wrote:
> >>>>> Are there any debugging commands we can run once the issue happens
> >>>>> that can help to determine the cause of this issue?
> >>>>
> >>>> Once the issue happens, the precipitating bug has already done its
> >>>> damage, so at that point it is too late.
> >>
> >> I've studied the code and bug reports a bit. I see one intriguing
> >> mention in comment #5:
> >>
> >> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
> >>
> >> /proc/130/stack:
> >> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
> >> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
> >> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
> >> [<0>] process_one_work+0x1c7/0x380
> >> [<0>] worker_thread+0x4d/0x380
> >> [<0>] kthread+0xda/0x100
> >> [<0>] ret_from_fork+0x22/0x30
> >>
> >> This tells me that the active item on the callback_wq is waiting for the
> >> backchannel RPC client to shut down. This is probably the proximal cause
> >> of the callback workqueue stall.
> >>
> >> rpc_shutdown_client() is waiting for the client's cl_tasks to become
> >> empty. Typically this is a short wait. But here, there's one or more RPC
> >> requests that are not completing.
> >>
> >> Please issue these two commands on your server once it gets into the
> >> hung state:
> >>
> >> # rpcdebug -m rpc -c
> >> # echo t > /proc/sysrq-trigger
> >
> > There were no rpcdebug entries configured, so I don't think the first
> > command did much.
> >
> > You can find the output from the second command in attach.
>
> I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
> do see is a large number of OOM-killer notices. So, your server is out
> of memory. But I think this is due to a memory leak bug, probably this
> one:

I'm confused. Where do you see the OOM-killer notices in the log I provided?

The first lines of the file is after increasing the hung_task_warnings
and waiting a few minutes. This triggered the hun task on the nfsd4
laundromat_main workqueue:

Workqueue: nfsd4 laundromat_main [nfsd]
Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:

Then I executed the commands you provided. Are the lines after the

Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State

Line not the output you're looking for?

Regards,
Rik

>
>     https://bugzilla.kernel.org/show_bug.cgi?id=219535
>
> So that explains the source of the frequent deleg_reaper() calls on your
> system; it's the shrinker. (Note these calls are not the actual problem.
> The real bug is somewhere in the callback code, but frequent callbacks
> are making it easy to hit the callback bug).
>
> Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
> for more than 120 seconds." in the journal before issuing the rpcdebug
> and "echo t" commands.
>
>
> > Regards,
> > Rik
> >
> >>
> >> Then gift-wrap the server's system journal and send it to me. I need to
> >> see only the output from these two commands, so if you want to
> >> anonymize the journal and truncate it to just the day of the failure,
> >> I think that should be fine.
> >>
> >>
> >> --
> >> Chuck Lever
> >
> >
> >
>
>
> --
> Chuck Lever

-- 

Rik