file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

for our researchers we are running file servers in the hundreds-of-TiB to
low-PiB range that export via NFS and SMB. Storage is iSCSI-over-Infiniband
LUNs LVM'ed into individual XFS file systems. With Ubuntu 18.04 nearing EOL,
we prepared an upgrade to Debian bookworm and tests went well. About a week
after one of the upgrades, we ran into the first occurence of our problem: all
of a sudden, all nfsds enter the D state and are not recoverable. However, the
underlying file systems seem fine and can be read and written to. The only way
out appears to be to reboot the server. The only clues are the frozen nfsds
and strack traces like

[<0>] rq_qos_wait+0xbc/0x130
[<0>] wbt_wait+0xa2/0x110
[<0>] __rq_qos_throttle+0x20/0x40
[<0>] blk_mq_submit_bio+0x2d3/0x580
[<0>] submit_bio_noacct_nocheck+0xf7/0x2c0
[<0>] iomap_submit_ioend+0x4b/0x80
[<0>] iomap_do_writepage+0x4b4/0x820
[<0>] write_cache_pages+0x180/0x4c0
[<0>] iomap_writepages+0x1c/0x40
[<0>] xfs_vm_writepages+0x79/0xb0 [xfs]
[<0>] do_writepages+0xbd/0x1c0
[<0>] filemap_fdatawrite_wbc+0x5f/0x80
[<0>] __filemap_fdatawrite_range+0x58/0x80
[<0>] file_write_and_wait_range+0x41/0x90
[<0>] xfs_file_fsync+0x5a/0x2a0 [xfs]
[<0>] nfsd_commit+0x93/0x190 [nfsd]
[<0>] nfsd4_commit+0x5e/0x90 [nfsd]
[<0>] nfsd4_proc_compound+0x352/0x660 [nfsd]
[<0>] nfsd_dispatch+0x167/0x280 [nfsd]
[<0>] svc_process_common+0x286/0x5e0 [sunrpc]
[<0>] svc_process+0xad/0x100 [sunrpc]
[<0>] nfsd+0xd5/0x190 [nfsd]
[<0>] kthread+0xe6/0x110
[<0>] ret_from_fork+0x1f/0x30

(we've also seen nfsd3). It's very sporadic, we have no idea what's triggering
it and it has now happened 4 times on one server and once on a second.
Needless to say, these are production systems, so we have a window of a few
minutes for debugging before people start yelling. We've thrown everything we
could at our test setup but so far haven't been able to trigger it.
Any pointers would be highly appreciated.


thanks and best regards,
-Christian



cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

uname -vr
6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19)

apt list --installed '*nfs*'
libnfsidmap1/testing,now 1:2.6.2-4 amd64 [installed,automatic]
nfs-common/testing,now 1:2.6.2-4 amd64 [installed]
nfs-kernel-server/testing,now 1:2.6.2-4 amd64 [installed]

nfsconf -d
[exportd]
 debug = all
[exportfs]
 debug = all
[general]
 pipefs-directory = /run/rpc_pipefs
[lockd]
 port = 32769
 udp-port = 32769
[mountd]
 debug = all
 manage-gids = True
 port = 892
[nfsd]
 debug = all
 port = 2049
 threads = 48
[nfsdcld]
 debug = all
[nfsdcltrack]
 debug = all
[sm-notify]
 debug = all
 outgoing-port = 846
[statd]
 debug = all
 outgoing-port = 2020
 port = 662



-- 
Dr. Christian Herzog <herzog@xxxxxxxxxxxx>  support: +41 44 633 26 68
Head, IT Services Group, HPT H 8              voice: +41 44 633 39 50
Department of Physics, ETH Zurich           
8093 Zurich, Switzerland                     http://isg.phys.ethz.ch/



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux