Re: file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

we just had another freeze on one of our bookworm file servers. The scenario
is a bit different, but the root cause might be just the same. So what
happened:

- the server had been happily serving NFS + SMB for two weeks
- today I noticed a left-over rsync process from a recent backup run that
  didn't do any IO and was in D state
- I killed this rsync process, but since it was in D, it never died
- after a few minutes I noticed an nfsd in D state too (but just one). I
  watched it for a bit and then decided to try "service nfs-kernel-server
  restart" to see if again nfs was involved. I guess it was...
- from then on, all sorts of processes entered eternal D: several smbd,
  autofs, the rsync and one nfsd
- however: at all times, the underlying file systems seemed perfectly fine. We
  could write to every single one of them and gdu the hundred-TiB ones without
  a problem
- my impression is that at least this time, nfsd was just one of the victims
  of a deeper problem
- we took all the forensics suggested last time by Kuai and Bob. I don't
  really understand them, but here's the facts:
  - memory on the machine is completely uncritical, < 20% used
  - the rqos/wbt/inflight of all block devices are 0 (remember: those are
    iSCSI LUNs)
  - all the hctx* values seem unsuspicious to me, but what do I know
  - the stacks traces of the D processes don't show any rq_qos_wait this time

here's the D rsync trace:

[<0>] iterate_dir+0x52/0x1c0
[<0>] __x64_sys_getdents64+0x84/0x120
[<0>] do_syscall_64+0x58/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd


and the D nfsd:

[<0>] vfs_rename+0x266/0xd70
[<0>] nfsd_rename+0x327/0x470 [nfsd]
[<0>] nfsd4_rename+0x53/0x110 [nfsd]
[<0>] nfsd4_proc_compound+0x352/0x660 [nfsd]
[<0>] nfsd_dispatch+0x167/0x280 [nfsd]
[<0>] svc_process_common+0x286/0x5e0 [sunrpc]
[<0>] svc_process+0xad/0x100 [sunrpc]
[<0>] nfsd+0xd5/0x190 [nfsd]
[<0>] kthread+0xe6/0x110
[<0>] ret_from_fork+0x1f/0x30

all the forensics are contained in
https://people.phys.ethz.ch/~daduke/freeze.tgz

we would be extremely grateful for any hints how we can debug (or even solve)
this. We're really at a loss here...


thanks and kind regards,
-Christian


-- 
Dr. Christian Herzog <herzog@xxxxxxxxxxxx>  support: +41 44 633 26 68
Head, IT Services Group, HPT H 8              voice: +41 44 633 39 50
Department of Physics, ETH Zurich           
8093 Zurich, Switzerland                     http://isg.phys.ethz.ch/



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux