Hello again Three weeks ago we reported on nfsd D state-induced freezes on our bookworm-upgraded files servers [1]. The general consensus at the time seems to have been that the real issue was deeper in our storage stack, so we headed over to linux-block but were never able to pinpoint the issue. We just had another instance where all our 64 nfsd processes were stuck in D state. This time the stack traces look different and we have some more hints in our logs, and this time we're pretty sure it's nfsd and not general block IO. All 64 nfds have similiar stack traces: 14 processes: [<0>] __flush_workqueue+0x152/0x420 [<0>] nfsd4_shutdown_callback+0x49/0x130 [nfsd] [<0>] __destroy_client+0x1f3/0x290 [nfsd] [<0>] nfsd4_exchange_id+0x752/0x760 [nfsd] [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd] [<0>] nfsd_dispatch+0x167/0x280 [nfsd] [<0>] svc_process_common+0x286/0x5e0 [sunrpc] [<0>] svc_process+0xad/0x100 [sunrpc] [<0>] nfsd+0xd5/0x190 [nfsd] [<0>] kthread+0xe6/0x110 [<0>] ret_from_fork+0x1f/0x30 9 processes: [<0>] __flush_workqueue+0x152/0x420 [<0>] nfsd4_shutdown_callback+0x49/0x130 [nfsd] [<0>] __destroy_client+0x1f3/0x290 [nfsd] [<0>] nfsd4_exchange_id+0x358/0x760 [nfsd] [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd] [<0>] nfsd_dispatch+0x167/0x280 [nfsd] [<0>] svc_process_common+0x286/0x5e0 [sunrpc] [<0>] svc_process+0xad/0x100 [sunrpc] [<0>] nfsd+0xd5/0x190 [nfsd] [<0>] kthread+0xe6/0x110 [<0>] ret_from_fork+0x1f/0x30 41 processes: [<0>] __flush_workqueue+0x152/0x420 [<0>] nfsd4_destroy_session+0x1b6/0x250 [nfsd] [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd] [<0>] nfsd_dispatch+0x167/0x280 [nfsd] [<0>] svc_process_common+0x286/0x5e0 [sunrpc] [<0>] svc_process+0xad/0x100 [sunrpc] [<0>] nfsd+0xd5/0x190 [nfsd] [<0>] kthread+0xe6/0x110 [<0>] ret_from_fork+0x1f/0x30 20 minutes prior to the first frozen nfsds, we saw messages similiar to receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000fcdd40ac xid 182df75c It seems these messages come from receive_cb_reply [2] and it looks like xprt_lookup_rqst cannot find the RPC request beloning to a certain transaction. We see these messages with different values for xpt_bc_xprt, which, we think, correspond to the different NFS clients. All this is on production file servers running Debian bookworm with iSCSI block devices and XFS file systems. Does anyone have any suggestions how to further debug this? Unfortunately we have yet to find a way to trigger it deliberately, for the time being it happens whenever it happens.... thanks and best regards, -Christian [1] https://www.spinics.net/lists/linux-nfs/msg96048.html [2] https://elixir.bootlin.com/linux/v6.1.20/source/net/sunrpc/svcsock.c#L902 -- Dr. Christian Herzog <herzog@xxxxxxxxxxxx> support: +41 44 633 26 68 Head, IT Services Group, HPT H 8 voice: +41 44 633 39 50 Department of Physics, ETH Zurich 8093 Zurich, Switzerland http://isg.phys.ethz.ch/