Re: nfsd4 laundromat_main hung tasks

Rik Theys <rik.theys@xxxxxxxxx> · Thu, 16 Jan 2025 10:03:35 +0100

Hi Chuck,

On Tue, Jan 14, 2025 at 8:02 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
> On 1/14/25 11:10 AM, Chuck Lever wrote:
> > On 1/14/25 10:30 AM, Rik Theys wrote:
> >> Hi,
> >>
> >> On Tue, Jan 14, 2025 at 3:51 PM Chuck Lever <chuck.lever@xxxxxxxxxx>
> >> wrote:
> >>>
> >>> On 1/14/25 3:23 AM, Rik Theys wrote:
> >>>> Hi,
> >>>>
> >>>> On Mon, Jan 13, 2025 at 11:12 PM Chuck Lever
> >>>> <chuck.lever@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> On 1/12/25 7:42 AM, Rik Theys wrote:
> >>>>>> On Fri, Jan 10, 2025 at 11:07 PM Chuck Lever
> >>>>>> <chuck.lever@xxxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> On 1/10/25 3:51 PM, Rik Theys wrote:
> >>>>>>>> Are there any debugging commands we can run once the issue happens
> >>>>>>>> that can help to determine the cause of this issue?
> >>>>>>>
> >>>>>>> Once the issue happens, the precipitating bug has already done its
> >>>>>>> damage, so at that point it is too late.
> >>>>>
> >>>>> I've studied the code and bug reports a bit. I see one intriguing
> >>>>> mention in comment #5:
> >>>>>
> >>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562#5
> >>>>>
> >>>>> /proc/130/stack:
> >>>>> [<0>] rpc_shutdown_client+0xf2/0x150 [sunrpc]
> >>>>> [<0>] nfsd4_process_cb_update+0x4c/0x270 [nfsd]
> >>>>> [<0>] nfsd4_run_cb_work+0x9f/0x150 [nfsd]
> >>>>> [<0>] process_one_work+0x1c7/0x380
> >>>>> [<0>] worker_thread+0x4d/0x380
> >>>>> [<0>] kthread+0xda/0x100
> >>>>> [<0>] ret_from_fork+0x22/0x30
> >>>>>
> >>>>> This tells me that the active item on the callback_wq is waiting
> >>>>> for the
> >>>>> backchannel RPC client to shut down. This is probably the proximal
> >>>>> cause
> >>>>> of the callback workqueue stall.
> >>>>>
> >>>>> rpc_shutdown_client() is waiting for the client's cl_tasks to become
> >>>>> empty. Typically this is a short wait. But here, there's one or
> >>>>> more RPC
> >>>>> requests that are not completing.
> >>>>>
> >>>>> Please issue these two commands on your server once it gets into the
> >>>>> hung state:
> >>>>>
> >>>>> # rpcdebug -m rpc -c
> >>>>> # echo t > /proc/sysrq-trigger
> >>>>
> >>>> There were no rpcdebug entries configured, so I don't think the first
> >>>> command did much.
>
> The laundromat failure mode is not blocked in rpc_shutdown_client, so
> there aren't any outstanding callback RPCs to observe.
>
> The DESTROY_SESSION failure mode is blocking on the flush_workqueue call
> in nfsd4_shutdown_callback(), while this failure mode appears to have
> passed that call and blocked on the wait for in-flight RPCs to go to
> zero (as Jeff analyzed a few days ago).

If I look at the trace, nfs4_laundromat calls
nfs4_process_client_reaplist, which calls __destroy_client at some
point.

When I look at the __destroy_client function in nfs4state.c, I see it
does a spin_lock(&state_lock) and spin_unlock(&state_lock) to perform
certain actions, but it seems the lock is not (again) acquired when
the nfsd4_shutdown_callback() function is called? According to the
comment above the nfsd4_shutdown_callback function in nfs4callback.c,
the function must be called under the state lock? Is it possible that
the function is called without this state lock? Or is the comment no
longer relevant?

Another thing I've noticed (but I'm not sure it's relevant here) is
that there's a client in /proc/nfs/nfsd/clients that has a states file
that crashes nfsdclnts as the field does not have a "superblock"
field:

# cat 8536/{info,states}
clientid: 0x6d0596d0675df2b3
address: "10.87.29.32:864"
status: courtesy
seconds from last renew: 2807740
name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
Dec 11 16:33:48 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UNKNOWN
callback address: 10.87.29.32:0
admin-revoked states: 0
- 0x00000001b3f25d67d096056d19facf00: { type: deleg, access: w }

This is one of the clients that has multiple entries in the
/proc/fs/nfsd/clients directory, but of all the clients that have
duplicate entries, this is the only one where the "broken" client is
in the "courtesy" state for a long time now. It's also the only
"broken" client that still has an entry in the states file. The others
are all in the "unconfirmed" state and the states file is empty.

Regards,
Rik

>
> Next time it happens, please collect the rpcdebug and "echo t" output as
> well as the trace-cmd output.
>
>
> >>>> You can find the output from the second command in attach.
> >>>
> >>> I don't see any output for "echo t > /proc/sysrq-trigger" here. What I
> >>> do see is a large number of OOM-killer notices. So, your server is out
> >>> of memory. But I think this is due to a memory leak bug, probably this
> >>> one:
> >>
> >> I'm confused. Where do you see the OOM-killer notices in the log I
> >> provided?
> >
> > Never mind: my editor opened an old file at the same time. The window
> > with your dump was on another screen.
> >
> > Looking at your journal now.
> >
> >
> >> The first lines of the file is after increasing the hung_task_warnings
> >> and waiting a few minutes. This triggered the hun task on the nfsd4
> >> laundromat_main workqueue:
> >>
> >> Workqueue: nfsd4 laundromat_main [nfsd]
> >> Jan 14 09:06:45 kwak.esat.kuleuven.be kernel: Call Trace:
> >>
> >> Then I executed the commands you provided. Are the lines after the
> >>
> >> Jan 14 09:07:00 kwak.esat.kuleuven.be kernel: sysrq: Show State
> >>
> >> Line not the output you're looking for?
> >>
> >> Regards,
> >> Rik
> >>
> >>>
> >>>      https://bugzilla.kernel.org/show_bug.cgi?id=219535
> >>>
> >>> So that explains the source of the frequent deleg_reaper() calls on your
> >>> system; it's the shrinker. (Note these calls are not the actual problem.
> >>> The real bug is somewhere in the callback code, but frequent callbacks
> >>> are making it easy to hit the callback bug).
> >>>
> >>> Please try again, but wait until you see "INFO: task nfsd:XXXX blocked
> >>> for more than 120 seconds." in the journal before issuing the rpcdebug
> >>> and "echo t" commands.
> >>>
> >>>
> >>>> Regards,
> >>>> Rik
> >>>>
> >>>>>
> >>>>> Then gift-wrap the server's system journal and send it to me. I
> >>>>> need to
> >>>>> see only the output from these two commands, so if you want to
> >>>>> anonymize the journal and truncate it to just the day of the failure,
> >>>>> I think that should be fine.
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Chuck Lever
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Chuck Lever
> >>
> >>
> >>
> >
> >
>
>
> --
> Chuck Lever

-- 

Rik