On Wed, Jul 10, 2024 at 01:46:01PM +0200, Paul Menzel wrote: > Dear Linux folks, > > > Exporting directories over NFS on a Dell PowerEdge R420 with Linux 5.15.86, > users noticed intermittent hangs. For example, > > df /project/something # on an NFS client > > on a different system timed out. > > @grele:~$ more /proc/mdstat > Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] > [multipath] > md3 : active raid6 sdr[0] sdp[11] sdx[10] sdt[9] sdo[8] sdw[7] sds[6] > sdm[5] sdu[4] sdq[3] sdn[2] sdv[1] > 156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 2 ^^^^^^^^^^^^ There's the likely source of your problem - 1MB raid chunk size for a 10+2 RAID6 array. That's a stripe width of 10MB, and that will only work well with really large sequential streaming IO. However, the general NFS server IO pattern over time will be small semi-random IO patterns scattered all over the place. IOWs, the average NFS server IO pattern is about the worst case IO pattern for a massively wide RAID6 stripe.... > In that time, we noticed all 64 NFSD processes being in uninterruptible > sleep and the I/O requests currently in process increasing for the RAID6 > device *md0* > > /sys/devices/virtual/block/md0/inflight : 10 921 > > but with no disk activity according to iostat. There was only “little NFS > activity” going on as far as we saw. This alternated for around half an our, > and then we decreased the NFS processes from 64 to 8. Has nothing to do with the number of NFSD tasks, I think. They are all stuck waiting for page cache IO, journal write IO or the inode locks for the inodes that are blocked on this IO. The most informative nfsd stack trace is this one: # # /proc/1414/task/1414: nfsd : # cat /proc/1414/task/1414/stack [<0>] submit_bio_wait+0x5b/0x90 [<0>] iomap_read_page_sync+0xaf/0xe0 [<0>] iomap_write_begin+0x3d1/0x5e0 [<0>] iomap_file_buffered_write+0x125/0x250 [<0>] xfs_file_buffered_write+0xc6/0x2d0 [<0>] do_iter_readv_writev+0x14f/0x1b0 [<0>] do_iter_write+0x7b/0x1d0 [<0>] nfsd_vfs_write+0x2f3/0x640 [nfsd] [<0>] nfsd4_write+0x116/0x210 [nfsd] [<0>] nfsd4_proc_compound+0x2d1/0x640 [nfsd] [<0>] nfsd_dispatch+0x150/0x250 [nfsd] [<0>] svc_process_common+0x440/0x6d0 [sunrpc] [<0>] svc_process+0xb7/0xf0 [sunrpc] [<0>] nfsd+0xe8/0x140 [nfsd] [<0>] kthread+0x124/0x150 [<0>] ret_from_fork+0x1f/0x30 That's a buffered write into the page cache blocking on a read IO to fill the page because the NFS client is doing subpage or unaligned IO. So there's slow, IO latency dependent small RMW write operations happening at the page cache. IOWs, you've got a NFS client side application performing suboptimal unaligned IO patterns causing the incoming writes to take the slow path through the page cache (i.e. a RMW cycle). When these get flushed from the cache by the nfs commit operation: # # /proc/1413/task/1413: nfsd : # cat /proc/1413/task/1413/stack [<0>] wait_on_page_bit_common+0xfa/0x3b0 [<0>] wait_on_page_writeback+0x2a/0x80 [<0>] __filemap_fdatawait_range+0x81/0xf0 [<0>] file_write_and_wait_range+0xdf/0x100 [<0>] xfs_file_fsync+0x63/0x250 [<0>] nfsd_commit+0xd8/0x180 [nfsd] [<0>] nfsd4_proc_compound+0x2d1/0x640 [nfsd] [<0>] nfsd_dispatch+0x150/0x250 [nfsd] [<0>] svc_process_common+0x440/0x6d0 [sunrpc] [<0>] svc_process+0xb7/0xf0 [sunrpc] [<0>] nfsd+0xe8/0x140 [nfsd] [<0>] kthread+0x124/0x150 [<0>] ret_from_fork+0x1f/0x30 writeback then stalls waiting for the underlying MD device to flush the small IO to the RAID6 storage. This causes massive write amplification at the RAID6 level, as it requires a RMW of the RAID6 stripe to recalculate the parity blocks with the new changed data in it, and that then gets forced to disk because the NFSD is asking for the data being written to be durable. This is basically worst case behaviour for small write IO both in terms of latency and write amplification. > We captured some more data from sysfs [1]. md0 is the busy device: # # /proc/855/task/855: md0_raid6 : # cat /proc/855/task/855/stack [<0>] blk_mq_get_tag+0x11d/0x2c0 [<0>] __blk_mq_alloc_request+0xe1/0x120 [<0>] blk_mq_submit_bio+0x141/0x530 [<0>] submit_bio_noacct+0x26b/0x2b0 [<0>] ops_run_io+0x7e2/0xcf0 [<0>] handle_stripe+0xacb/0x2100 [<0>] handle_active_stripes.constprop.0+0x3d9/0x580 [<0>] raid5d+0x338/0x5a0 [<0>] md_thread+0xa8/0x160 [<0>] kthread+0x124/0x150 [<0>] ret_from_fork+0x1f/0x30 This is the background stripe cache flushing getting stuck waiting for tag space on the underlying block devices. This happens because they have run out of IO queue depth (i.e. are at capacity and overloaded). The raid6-md0 slab indicates: raid6-md0 125564 125564 4640 1 2 : tunables 8 4 0 : slabdata 125564 125564 0 there are 125k stripe head objects active which means there has been a *lot* of partial stripe writes been done and are active in memory. The "inflight" IOs indicate that it's bottlenecked on writeback of dirty cached stripes. Oh, there's CIFS server on this machine, too, and it's having the same issues: # # /proc/7624/task/12539: smbd : /usr/local/samba4/sbin/smbd --configfile=/etc/samba4/smb.conf-grele --daemon # cat /proc/7624/task/12539/stack [<0>] md_bitmap_startwrite+0x14a/0x1c0 [<0>] add_stripe_bio+0x3f6/0x6d0 [<0>] raid5_make_request+0x1dd/0xbb0 [<0>] md_handle_request+0x11f/0x1b0 [<0>] md_submit_bio+0x66/0xb0 [<0>] __submit_bio+0x162/0x200 [<0>] submit_bio_noacct+0xa8/0x2b0 [<0>] iomap_do_writepage+0x382/0x800 [<0>] write_cache_pages+0x18f/0x3f0 [<0>] iomap_writepages+0x1c/0x40 [<0>] xfs_vm_writepages+0x71/0xa0 [<0>] do_writepages+0xc0/0x1f0 [<0>] filemap_fdatawrite_wbc+0x78/0xc0 [<0>] file_write_and_wait_range+0x9c/0x100 [<0>] xfs_file_fsync+0x63/0x250 [<0>] __x64_sys_fsync+0x34/0x60 [<0>] do_syscall_64+0x40/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb Yup, that looks to be a partial stripe write being started and MD having to update the dirty bitmap and it's probably blocking because the dirty bitmap is full. i.e. it is stuck waiting for stripe writeback to complete and, as per the md0_raid6 stack above, this is waiting on busy devices to complete IO. > Of course it’s not reproducible, but any insight how to debug this next time > is much welcomed. Probably not a lot you can do short of reconfiguring your RAID6 storage devices to handle small IOs better. However, in general, RAID6 /always sucks/ for small IOs, and the only way to fix this problem is to use high performance SSDs to give you a massive excess of write bandwidth to burn on write amplification.... You could also try to find NFS/CIFS client is doing poor/small IO patterns and get them to do better IO patterns, but that might not be fixable. tl;dr: this isn't a bug in kernel code, this a result of a worst case workload for a given sotrage configuration. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx