Re: How to debug intermittent increasing md/inflight but no disk activity?

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 11 Jul 2024 09:12:28 +1000

On Wed, Jul 10, 2024 at 01:46:01PM +0200, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> Exporting directories over NFS on a Dell PowerEdge R420 with Linux 5.15.86,
> users noticed intermittent hangs. For example,
> 
>     df /project/something # on an NFS client
> 
> on a different system timed out.
> 
>     @grele:~$ more /proc/mdstat
>     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>     md3 : active raid6 sdr[0] sdp[11] sdx[10] sdt[9] sdo[8] sdw[7] sds[6]
> sdm[5] sdu[4] sdq[3] sdn[2] sdv[1]
>           156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 2
                                                   ^^^^^^^^^^^^

There's the likely source of your problem - 1MB raid chunk size for
a 10+2 RAID6 array. That's a stripe width of 10MB, and that will
only work well with really large sequential streaming IO. However,
the general NFS server IO pattern over time will be small
semi-random IO patterns scattered all over the place. 

IOWs, the average NFS server IO pattern is about the worst case IO
pattern for a massively wide RAID6 stripe....

> In that time, we noticed all 64 NFSD processes being in uninterruptible
> sleep and the I/O requests currently in process increasing for the RAID6
> device *md0*
> 
>     /sys/devices/virtual/block/md0/inflight : 10 921
> 
> but with no disk activity according to iostat. There was only “little NFS
> activity” going on as far as we saw. This alternated for around half an our,
> and then we decreased the NFS processes from 64 to 8.

Has nothing to do with the number of NFSD tasks, I think. They are
all stuck waiting for page cache IO, journal write IO or
the inode locks for the inodes that are blocked on this IO.

The most informative nfsd stack trace is this one:

# # /proc/1414/task/1414: nfsd :
# cat /proc/1414/task/1414/stack

[<0>] submit_bio_wait+0x5b/0x90
[<0>] iomap_read_page_sync+0xaf/0xe0
[<0>] iomap_write_begin+0x3d1/0x5e0
[<0>] iomap_file_buffered_write+0x125/0x250
[<0>] xfs_file_buffered_write+0xc6/0x2d0
[<0>] do_iter_readv_writev+0x14f/0x1b0
[<0>] do_iter_write+0x7b/0x1d0
[<0>] nfsd_vfs_write+0x2f3/0x640 [nfsd]
[<0>] nfsd4_write+0x116/0x210 [nfsd]
[<0>] nfsd4_proc_compound+0x2d1/0x640 [nfsd]
[<0>] nfsd_dispatch+0x150/0x250 [nfsd]
[<0>] svc_process_common+0x440/0x6d0 [sunrpc]
[<0>] svc_process+0xb7/0xf0 [sunrpc]
[<0>] nfsd+0xe8/0x140 [nfsd]
[<0>] kthread+0x124/0x150
[<0>] ret_from_fork+0x1f/0x30

That's a buffered write into the page cache blocking on a read IO
to fill the page because the NFS client is doing subpage or
unaligned IO. So there's slow, IO latency dependent small RMW write
operations happening at the page cache.

IOWs, you've got a NFS client side application performing suboptimal
unaligned IO patterns causing the incoming writes to take the slow
path through the page cache (i.e. a RMW cycle).

When these get flushed from the cache by the nfs commit operation:

# # /proc/1413/task/1413: nfsd :
# cat /proc/1413/task/1413/stack

[<0>] wait_on_page_bit_common+0xfa/0x3b0
[<0>] wait_on_page_writeback+0x2a/0x80
[<0>] __filemap_fdatawait_range+0x81/0xf0
[<0>] file_write_and_wait_range+0xdf/0x100
[<0>] xfs_file_fsync+0x63/0x250
[<0>] nfsd_commit+0xd8/0x180 [nfsd]
[<0>] nfsd4_proc_compound+0x2d1/0x640 [nfsd]
[<0>] nfsd_dispatch+0x150/0x250 [nfsd]
[<0>] svc_process_common+0x440/0x6d0 [sunrpc]
[<0>] svc_process+0xb7/0xf0 [sunrpc]
[<0>] nfsd+0xe8/0x140 [nfsd]
[<0>] kthread+0x124/0x150
[<0>] ret_from_fork+0x1f/0x30

writeback then stalls waiting for the underlying MD device to flush
the small IO to the RAID6 storage. This causes massive write
amplification at the RAID6 level, as it requires a RMW of the RAID6
stripe to recalculate the parity blocks with the new changed data in
it, and that then gets forced to disk because the NFSD is asking for
the data being written to be durable.

This is basically worst case behaviour for small write IO both in
terms of latency and write amplification.

> We captured some more data from sysfs [1].

md0 is the busy device:

# # /proc/855/task/855: md0_raid6 :
# cat /proc/855/task/855/stack

[<0>] blk_mq_get_tag+0x11d/0x2c0
[<0>] __blk_mq_alloc_request+0xe1/0x120
[<0>] blk_mq_submit_bio+0x141/0x530
[<0>] submit_bio_noacct+0x26b/0x2b0
[<0>] ops_run_io+0x7e2/0xcf0
[<0>] handle_stripe+0xacb/0x2100
[<0>] handle_active_stripes.constprop.0+0x3d9/0x580
[<0>] raid5d+0x338/0x5a0
[<0>] md_thread+0xa8/0x160
[<0>] kthread+0x124/0x150
[<0>] ret_from_fork+0x1f/0x30

This is the background stripe cache flushing getting stuck waiting
for tag space on the underlying block devices. This happens because
they have run out of IO queue depth (i.e. are at capacity and
overloaded).

The raid6-md0 slab indicates:

raid6-md0         125564 125564   4640    1    2 : tunables    8    4    0 : slabdata 125564 125564      0

there are 125k stripe head objects active which means there has been
a *lot* of partial stripe writes been done and are active in memory.
The "inflight" IOs indicate that it's bottlenecked on writeback
of dirty cached stripes.

Oh, there's CIFS server on this machine, too, and it's having the
same issues:

# # /proc/7624/task/12539: smbd : /usr/local/samba4/sbin/smbd --configfile=/etc/samba4/smb.conf-grele --daemon
# cat /proc/7624/task/12539/stack

[<0>] md_bitmap_startwrite+0x14a/0x1c0
[<0>] add_stripe_bio+0x3f6/0x6d0
[<0>] raid5_make_request+0x1dd/0xbb0
[<0>] md_handle_request+0x11f/0x1b0
[<0>] md_submit_bio+0x66/0xb0
[<0>] __submit_bio+0x162/0x200
[<0>] submit_bio_noacct+0xa8/0x2b0
[<0>] iomap_do_writepage+0x382/0x800
[<0>] write_cache_pages+0x18f/0x3f0
[<0>] iomap_writepages+0x1c/0x40
[<0>] xfs_vm_writepages+0x71/0xa0
[<0>] do_writepages+0xc0/0x1f0
[<0>] filemap_fdatawrite_wbc+0x78/0xc0
[<0>] file_write_and_wait_range+0x9c/0x100
[<0>] xfs_file_fsync+0x63/0x250
[<0>] __x64_sys_fsync+0x34/0x60
[<0>] do_syscall_64+0x40/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

Yup, that looks to be a partial stripe write being started and MD
having to update the dirty bitmap and it's probably blocking because
the dirty bitmap is full. i.e. it is stuck waiting for stripe
writeback to complete and, as per the md0_raid6 stack above, this
is waiting on busy devices to complete IO.

> Of course it’s not reproducible, but any insight how to debug this next time
> is much welcomed.

Probably not a lot you can do short of reconfiguring your RAID6
storage devices to handle small IOs better. However, in general,
RAID6 /always sucks/ for small IOs, and the only way to fix this
problem is to use high performance SSDs to give you a massive excess
of write bandwidth to burn on write amplification....

You could also try to find NFS/CIFS client is doing poor/small IO
patterns and get them to do better IO patterns, but that might not
be fixable.

tl;dr: this isn't a bug in kernel code, this a result of a worst
case workload for a given sotrage configuration.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx