Dear all, we just had another freeze on one of our bookworm file servers. The scenario is a bit different, but the root cause might be just the same. So what happened: - the server had been happily serving NFS + SMB for two weeks - today I noticed a left-over rsync process from a recent backup run that didn't do any IO and was in D state - I killed this rsync process, but since it was in D, it never died - after a few minutes I noticed an nfsd in D state too (but just one). I watched it for a bit and then decided to try "service nfs-kernel-server restart" to see if again nfs was involved. I guess it was... - from then on, all sorts of processes entered eternal D: several smbd, autofs, the rsync and one nfsd - however: at all times, the underlying file systems seemed perfectly fine. We could write to every single one of them and gdu the hundred-TiB ones without a problem - my impression is that at least this time, nfsd was just one of the victims of a deeper problem - we took all the forensics suggested last time by Kuai and Bob. I don't really understand them, but here's the facts: - memory on the machine is completely uncritical, < 20% used - the rqos/wbt/inflight of all block devices are 0 (remember: those are iSCSI LUNs) - all the hctx* values seem unsuspicious to me, but what do I know - the stacks traces of the D processes don't show any rq_qos_wait this time here's the D rsync trace: [<0>] iterate_dir+0x52/0x1c0 [<0>] __x64_sys_getdents64+0x84/0x120 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd and the D nfsd: [<0>] vfs_rename+0x266/0xd70 [<0>] nfsd_rename+0x327/0x470 [nfsd] [<0>] nfsd4_rename+0x53/0x110 [nfsd] [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd] [<0>] nfsd_dispatch+0x167/0x280 [nfsd] [<0>] svc_process_common+0x286/0x5e0 [sunrpc] [<0>] svc_process+0xad/0x100 [sunrpc] [<0>] nfsd+0xd5/0x190 [nfsd] [<0>] kthread+0xe6/0x110 [<0>] ret_from_fork+0x1f/0x30 all the forensics are contained in https://people.phys.ethz.ch/~daduke/freeze.tgz we would be extremely grateful for any hints how we can debug (or even solve) this. We're really at a loss here... thanks and kind regards, -Christian -- Dr. Christian Herzog <herzog@xxxxxxxxxxxx> support: +41 44 633 26 68 Head, IT Services Group, HPT H 8 voice: +41 44 633 39 50 Department of Physics, ETH Zurich 8093 Zurich, Switzerland http://isg.phys.ethz.ch/