Re: nfsd in D state, another case

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Miha,

I don't have any solution for this problem based on the provided data.
However next time when this problem occurs, please do the following on
the NFS server to gather some diagnostic info so we can investigate:

1. dump the workqueues:

#  echo t > /proc/sysrq-trigger

Output is written to /var/log/messages.

2. gather nfsd and rpc task info:

# (for tt in $(grep -l 'nfs' /proc/*/stack); do echo "${tt}:"; cat ${tt}; echo; done) > /tmp/nfs_threads.txt

# cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt


You can put /var/log/messages, nfs_threads.txt and rpc_tasks.txt at some
public place where we can get to, or you can email them directly to me.

Thanks,
-Dai

On 7/3/23 2:05 AM, Miha Vrhovnik wrote:
Hey everybody,

we are experiencing a similar issue to Dr. Herzog [1]. the details and
our configuration is a bit different than his, but I'd assume that the
same bug is behind this.

It's a single server per region, with NFS is on local storage formatted as ext4.
The servers don't have a swap and have plenty of free memory. 16G-20G
used (without caches) out of 64G.
The amount of data stored is rather small about 1G, with about 100.000
files over 17.000 directories

We've upgraded the servers from Ubuntu 20.04 to Ubuntu 22.04 about a
month ago and we had four outages since than.Our clients are still on
Ubuntu 20.04 if this helps
Our setup is rather small.But the impacts aprox. 10.000 users per region.

And the packages in Ubuntu 22.04 are a bit older than the ones in
Debian 12 (bookworm )

apt list --installed '*nfs*'
Listing... Done
libnfsidmap1/jammy-updates,now 1:2.6.1-1ubuntu1.2 amd64 [installed,automatic]
nfs-common/jammy-updates,now 1:2.6.1-1ubuntu1.2 amd64 [installed]
nfs-kernel-server/jammy-updates,now 1:2.6.1-1ubuntu1.2 amd64 [installed]

and the kernel
Linux redacted 5.19.0-1026-gcp #28~22.04.1-Ubuntu SMP Tue Jun 6
07:24:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Stack-trace is a different from the one provided by Dr. Herzog. But
the same between servers.
so only full server restart helps. And once this happened almost
immediately after the server was rebooted.

This is the first time that it happened that day.

Jun 26 11:34:07 redacted kernel: [326687.684554] receive_cb_reply: Got
unrecognized reply: calldir 0x1 xpt_bc_xprt 000000000dddae84 xid
bb78d292
Jun 26 11:34:07 redacted kernel: [326687.684634] receive_cb_reply: Got
unrecognized reply: calldir 0x1 xpt_bc_xprt 000000000dddae84 xid
ba78d292
Jun 26 11:41:38 redacted kernel: [327139.472177] receive_cb_reply: Got
unrecognized reply: calldir 0x1 xpt_bc_xprt 000000004df70d45 xid
3bfdc2ea
Jun 26 11:44:30 redacted kernel: [327310.678539] INFO: task nfsd:848
blocked for more than 120 seconds.
Jun 26 11:44:30 redacted kernel: [327310.685086]       Not tainted
5.19.0-1026-gcp #28~22.04.1-Ubuntu
Jun 26 11:44:30 redacted kernel: [327310.691358] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 26 11:44:30 redacted kernel: [327310.699442] task:nfsd
state:D stack:    0 pid:  848 ppid:     2 flags:0x00004000
Jun 26 11:44:30 redacted kernel: [327310.699451] Call Trace:
Jun 26 11:44:30 redacted kernel: [327310.699456]  <TASK>
Jun 26 11:44:30 redacted kernel: [327310.699464]  __schedule+0x248/0x5d0
Jun 26 11:44:30 redacted kernel: [327310.699478]  schedule+0x58/0x100
Jun 26 11:44:30 redacted kernel: [327310.699482]  schedule_timeout+0x10d/0x140
Jun 26 11:44:30 redacted kernel: [327310.699489]  __wait_for_common+0x99/0x1b0
Jun 26 11:44:30 redacted kernel: [327310.699493]  ? usleep_range_state+0xa0/0xa0
Jun 26 11:44:30 redacted kernel: [327310.699497]  wait_for_completion+0x24/0x40
Jun 26 11:44:30 redacted kernel: [327310.699501]  __flush_workqueue+0x140/0x3e0
Jun 26 11:44:30 redacted kernel: [327310.699512]
nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
Jun 26 11:44:30 redacted kernel: [327310.699566]
nfsd4_destroy_session+0x184/0x230 [nfsd]
Jun 26 11:44:30 redacted kernel: [327310.699597]
nfsd4_proc_compound+0x42b/0x770 [nfsd]
Jun 26 11:44:30 redacted kernel: [327310.699629]
nfsd_dispatch+0x174/0x270 [nfsd]
Jun 26 11:44:30 redacted kernel: [327310.699670]
svc_process_common+0x2a5/0x610 [sunrpc]
Jun 26 11:44:30 redacted kernel: [327310.699760]  ?
svc_handle_xprt+0x166/0x350 [sunrpc]
Jun 26 11:44:30 redacted kernel: [327310.699807]  ? nfsd_svc+0x1a0/0x1a0 [nfsd]
Jun 26 11:44:30 redacted kernel: [327310.699835]  ?
nfsd_shutdown_threads+0xa0/0xa0 [nfsd]
Jun 26 11:44:30 redacted kernel: [327310.699863]
svc_process+0xba/0x110 [sunrpc]
Jun 26 11:44:30 redacted kernel: [327310.699906]  nfsd+0xd1/0x1a0 [nfsd]
Jun 26 11:44:30 redacted kernel: [327310.699932]  kthread+0xce/0xf0
Jun 26 11:44:30 redacted kernel: [327310.699937]  ?
kthread_complete_and_exit+0x20/0x20
Jun 26 11:44:30 redacted kernel: [327310.699941]  ret_from_fork+0x1f/0x30
Jun 26 11:44:30 redacted kernel: [327310.699948]  </TASK>



This is about 10 minutes after reboot:

Jun 26 12:45:13 redacted kernel: [  373.274344] receive_cb_reply: Got
unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000016b8f89c xid
b4752784
Jun 26 12:49:06 redacted kernel: [  605.568257] INFO: task nfsd:887
blocked for more than 120 seconds.
Jun 26 12:49:06 redacted kernel: [  605.574821]       Not tainted
5.19.0-1026-gcp #28~22.04.1-Ubuntu
Jun 26 12:49:06 redacted kernel: [  605.581432] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 26 12:49:06 redacted kernel: [  605.589914] task:nfsd
state:D stack:    0 pid:  887 ppid:     2 flags:0x00004000
Jun 26 12:49:06 redacted kernel: [  605.589926] Call Trace:
Jun 26 12:49:06 redacted kernel: [  605.589931]  <TASK>
Jun 26 12:49:06 redacted kernel: [  605.589937]  __schedule+0x248/0x5d0
Jun 26 12:49:06 redacted kernel: [  605.590112]  schedule+0x58/0x100
Jun 26 12:49:06 redacted kernel: [  605.590117]  schedule_timeout+0x10d/0x140
Jun 26 12:49:06 redacted kernel: [  605.590125]  __wait_for_common+0x99/0x1b0
Jun 26 12:49:06 redacted kernel: [  605.590129]  ? usleep_range_state+0xa0/0xa0
Jun 26 12:49:06 redacted kernel: [  605.590134]  wait_for_completion+0x24/0x40
Jun 26 12:49:06 redacted kernel: [  605.590138]  __flush_workqueue+0x140/0x3e0
Jun 26 12:49:06 redacted kernel: [  605.590147]
nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
Jun 26 12:49:06 redacted kernel: [  605.590199]
nfsd4_destroy_session+0x184/0x230 [nfsd]
Jun 26 12:49:06 redacted kernel: [  605.590235]
nfsd4_proc_compound+0x42b/0x770 [nfsd]
Jun 26 12:49:06 redacted kernel: [  605.590270]
nfsd_dispatch+0x174/0x270 [nfsd]
Jun 26 12:49:06 redacted kernel: [  605.590303]
svc_process_common+0x2a5/0x610 [sunrpc]
Jun 26 12:49:06 redacted kernel: [  605.590392]  ?
svc_handle_xprt+0x166/0x350 [sunrpc]
Jun 26 12:49:06 redacted kernel: [  605.590447]  ? nfsd_svc+0x1a0/0x1a0 [nfsd]
Jun 26 12:49:06 redacted kernel: [  605.590479]  ?
nfsd_shutdown_threads+0xa0/0xa0 [nfsd]
Jun 26 12:49:06 redacted kernel: [  605.590510]  svc_process+0xba/0x110 [sunrpc]
Jun 26 12:49:06 redacted kernel: [  605.590561]  nfsd+0xd1/0x1a0 [nfsd]
Jun 26 12:49:06 redacted kernel: [  605.590598]  kthread+0xce/0xf0
Jun 26 12:49:06 redacted kernel: [  605.590603]  ?
kthread_complete_and_exit+0x20/0x20
Jun 26 12:49:06 redacted kernel: [  605.590608]  ret_from_fork+0x1f/0x30
Jun 26 12:49:06 redacted kernel: [  605.590616]  </TASK>

when things started to go really bad we gt another stacktrace:
Jun 26 12:53:07 redacted kernel: [  847.232530] INFO: task
kworker/u8:3:5078 blocked for more than 120 seconds.
Jun 26 12:53:07 redacted kernel: [  847.239937]       Not tainted
5.19.0-1026-gcp #28~22.04.1-Ubuntu
Jun 26 12:53:08 redacted kernel: [  847.246163] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 26 12:53:08 redacted kernel: [  847.254308] task:kworker/u8:3
state:D stack:    0 pid: 5078 ppid:     2 flags:0x00004000
Jun 26 12:53:08 redacted kernel: [  847.254328] Workqueue: nfsd4
laundromat_main [nfsd]
Jun 26 12:53:08 redacted kernel: [  847.254438] Call Trace:
Jun 26 12:53:08 redacted kernel: [  847.254443]  <TASK>
Jun 26 12:53:08 redacted kernel: [  847.254449]  __schedule+0x248/0x5d0
Jun 26 12:53:08 redacted kernel: [  847.254460]  ? available_idle_cpu+0x66/0xa0
Jun 26 12:53:08 redacted kernel: [  847.254467]  schedule+0x58/0x100
Jun 26 12:53:08 redacted kernel: [  847.254472]  schedule_timeout+0x10d/0x140
Jun 26 12:53:08 redacted kernel: [  847.254478]  __wait_for_common+0x99/0x1b0
Jun 26 12:53:08 redacted kernel: [  847.254482]  ? usleep_range_state+0xa0/0xa0
Jun 26 12:53:08 redacted kernel: [  847.254485]  wait_for_completion+0x24/0x40
Jun 26 12:53:08 redacted kernel: [  847.254489]  __flush_workqueue+0x140/0x3e0
Jun 26 12:53:08 redacted kernel: [  847.254494]
nfsd4_shutdown_callback+0x4d/0x130 [nfsd]
Jun 26 12:53:08 redacted kernel: [  847.254525]  ?
nfsd4_return_all_client_layouts+0xc9/0x160 [nfsd]
Jun 26 12:53:08 redacted kernel: [  847.254556]  ?
nfsd4_shutdown_copy+0x70/0xb0 [nfsd]
Jun 26 12:53:08 redacted kernel: [  847.254600]
__destroy_client+0x1a1/0x210 [nfsd]
Jun 26 12:53:08 redacted kernel: [  847.254630]  expire_client+0x57/0x70 [nfsd]
Jun 26 12:53:08 redacted kernel: [  847.254659]
nfs4_laundromat+0x26e/0x900 [nfsd]
Jun 26 12:53:08 redacted kernel: [  847.254690]  ?
finish_task_switch.isra.0+0x82/0x290
Jun 26 12:53:08 redacted kernel: [  847.254697]
laundromat_main+0x19/0x50 [nfsd]
Jun 26 12:53:08 redacted kernel: [  847.254726]  process_one_work+0x222/0x3c0
Jun 26 12:53:08 redacted kernel: [  847.254731]  worker_thread+0x50/0x3f0
Jun 26 12:53:08 redacted kernel: [  847.254734]  ? process_one_work+0x3c0/0x3c0
Jun 26 12:53:08 redacted kernel: [  847.254736]  kthread+0xce/0xf0
Jun 26 12:53:08 redacted kernel: [  847.254742]  ?
kthread_complete_and_exit+0x20/0x20
Jun 26 12:53:08 redacted kernel: [  847.254746]  ret_from_fork+0x1f/0x30
Jun 26 12:53:08 redacted kernel: [  847.254754]  </TASK>


1 - https://marc.info/?l=linux-nfs&m=168258759703478&w=2

Regards,
Miha



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux