nfsd4 laundromat_main hung tasks

Rik Theys <rik.theys@xxxxxxxxx> · Fri, 10 Jan 2025 20:49:57 +0100

Hi,

Our Rocky 9 NFS server running the upstream 6.11.11 kernel is starting
to log the following hung task messages:

INFO: task kworker/u194:11:1677933 blocked for more than 215285 seconds.
      Tainted: G        W   E      6.11.11-1.el9.esat.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u194:11 state:D stack:0     pid:1677933 tgid:1677933
ppid:2      flags:0x00004000
Workqueue: nfsd4 laundromat_main [nfsd]
Call Trace:
 <TASK>
 __schedule+0x21c/0x5d0
 ? preempt_count_add+0x47/0xa0
 schedule+0x26/0xa0
 nfsd4_shutdown_callback+0xea/0x120 [nfsd]
 ? __pfx_var_wake_function+0x10/0x10
 __destroy_client+0x1f0/0x290 [nfsd]
 nfs4_process_client_reaplist+0xa1/0x110 [nfsd]
 nfs4_laundromat+0x126/0x7a0 [nfsd]
 ? _raw_spin_unlock_irqrestore+0x23/0x40
 laundromat_main+0x16/0x40 [nfsd]
 process_one_work+0x179/0x390
 worker_thread+0x239/0x340
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xdb/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

If I read this correctly, it seems to be blocked on a callback
operation during shutdown of a client connection?

Is this a known issue that may be fixed in the 6.12.x kernel? Could
the following commit be relevant?

8dd91e8d31febf4d9cca3ae1bb4771d33ae7ee5a    nfsd: fix race between
laundromat and free_stateid

If I increase the hung_task_warnings sysctl and wait a few minutes,
the hung task message appears again, so the issue is still present on
the system. How can I debug which client is causing this issue?

Is there any other information I can provide?

Could this be related to the following thread:
https://lore.kernel.org/linux-nfs/Z2vNQ6HXfG_LqBQc@xxxxxxxxxxx/T/#u ?

I don't know if this is relevant but I've noticed that some clients
have multiple entries in the /proc/fs/nfsd/clients directory, so I
assume these clients are not cleaned up correctly?

For example:

clientid: 0x6d077c99675df2b3
address: "10.87.29.32:864"
status: confirmed
seconds from last renew: 0
name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
Dec 11 16:33:48 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UP
callback address: 10.87.29.32:0
admin-revoked states: 0
***
clientid: 0x6d0596d0675df2b3
address: "10.87.29.32:864"
status: courtesy
seconds from last renew: 2288446
name: "Linux NFSv4.2 betelgeuse.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 4.18.0-553.32.1.el8_10.x86_64 #1 SMP Wed
Dec 11 16:33:48 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UP
callback address: 10.87.29.32:0
admin-revoked states: 0

The first one has status confirmed and the second one "courtesy" with
a high "seconds from last renew". The address and port matches for
both client entries and the callback state is both UP.

For another client, there's a different output:

clientid: 0x6d078a79675df2b3
address: "10.33.130.34:864"
status: unconfirmed
seconds from last renew: 158910
name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UNKNOWN
callback address: (einval)
admin-revoked states: 0
***
clientid: 0x6d078a7a675df2b3
address: "10.33.130.34:864"
status: confirmed
seconds from last renew: 2
name: "Linux NFSv4.2 bujarski.esat.kuleuven.be"
minor version: 2
Implementation domain: "kernel.org"
Implementation name: "Linux 5.14.0-503.19.1.el9_5.x86_64 #1 SMP
PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64"
Implementation time: [0, 0]
callback state: UP
callback address: 10.33.130.34:0
admin-revoked states: 0

Here the first status is unconfirmed and the callback state is UNKNOWN.

The clients are Rocky 8, Rocky 9 and Fedora 41 clients.

Regards,

Rik