Re: [PATCH v2] fstests: generic/366: add a new test case to verify if certain fio load will hang the filesystem

Zorro Lang <zlang@xxxxxxxxxx> · Tue, 29 Oct 2024 13:56:39 +0800

On Tue, Oct 29, 2024 at 10:05:25AM +1030, Qu Wenruo wrote:
> [BUG]
> During the development to make btrfs pass generic/563 (which needs to
> make btrfs to support partial folios), generic/095 causes hangs
> during tests.
> 
> The call trace for the hanging process looks like this:
> 
>   __switch_to+0xf8/0x168
>   __schedule+0x328/0x8a8
>   schedule+0x54/0x140
>   io_schedule+0x44/0x68
>   folio_wait_bit_common+0x198/0x3f8
>   __folio_lock+0x24/0x40
>   extent_write_cache_pages+0x2e0/0x4c0 [btrfs]
>   btrfs_writepages+0x94/0x158 [btrfs]
>   do_writepages+0x74/0x190
>   filemap_fdatawrite_wbc+0x88/0xc8
>   __filemap_fdatawrite_range+0x6c/0xa8
>   filemap_fdatawrite_range+0x1c/0x30
>   btrfs_start_ordered_extent+0x264/0x2e0 [btrfs]
>   btrfs_lock_and_flush_ordered_range+0x8c/0x160 [btrfs]
>   __get_extent_map+0xa0/0x220 [btrfs]
>   btrfs_do_readpage+0x1bc/0x5d8 [btrfs]
>   btrfs_read_folio+0x50/0xa0 [btrfs]
>   filemap_read_folio+0x54/0x110
>   filemap_update_page+0x2e0/0x3b8
>   filemap_get_pages+0x228/0x4d8
>   filemap_read+0x11c/0x3b8
>   btrfs_file_read_iter+0x74/0x90 [btrfs]
>   new_sync_read+0xd0/0x1d0
>   vfs_read+0x1a0/0x1f0
> 
> [CAUSE]
> The root cause is a btrfs specific behavior that during a folio read, we
> can trigger writeback of the same folio, which will try to lock the same
> folio already locked by the read process.
> 
> The fix is already sent to the mailing list:
> https://lore.kernel.org/linux-btrfs/62bf73ada7be2888d45a787c2b6fd252103a5d25.1729725088.git.wqu@xxxxxxxx/
> 
> This problem can only happen if all the following conditions are met:
> 
> - The sector size of btrfs is smaller than page size
>   To have partial uptodate folios.
> 
> - Btrfs won't read the full folio if buffered write is block aligned
>   This is done by the not yet merged patch:
>   https://lore.kernel.org/linux-btrfs/ac2639ec4e9ac176d33e95ef7ecf008fa6be5461.1727833878.git.wqu@xxxxxxxx/
> 
> [TEST CASE]
> During the debugging of that generic/095 hang, I extracted a minimal
> reproducer which is much smaller and faster, although it still requires
> several runs to trigger a hang.
> 
> The test case will run the fio workload 32 times by default, which is
> more than enough to trigger the hang.
> 
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
> Changelog:
> v2:
> - Fix the duplicated _fixed_by_kernel_commit line
> - Fix a typo in the commit message
> - Update the comments to show the workload and how it hangs btrfs
> ---
>  tests/generic/366     | 106 ++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/366.out |   2 +
>  2 files changed, 108 insertions(+)
>  create mode 100755 tests/generic/366
>  create mode 100644 tests/generic/366.out
> 
> diff --git a/tests/generic/366 b/tests/generic/366
> new file mode 100755
> index 00000000..9d31c1c8
> --- /dev/null
> +++ b/tests/generic/366
> @@ -0,0 +1,106 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2024 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test 366
> +#
> +# Test if mixed direct read, direct write and buffered write on the same file will
> +# hang the filesystem.
> +#
> +# This is exposed by an incoming btrfs feature, which allows a folio to be
> +# partial uptodate if the buffered write range is block aligned but not yet
> +# full folio aligned.
> +#
> +# Such behavior makes btrfs to hang reliably under generic/095.
> +# This is the extracted minimal reproducer for 4k block size and 64K page size.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick
                            ^^
                            rw

I think we can keep the "rw" group as g/095 does.

> +
> +. ./common/filter
> +
> +_require_scratch
> +_require_odirect
> +_require_aio
> +
> +_fixed_by_kernel_commit xxxxxxxxxxxx \
> +	"btrfs: avoid deadlock when reading a partial uptodate folio"
> +
> +iterations=$((32 * LOAD_FACTOR))
> +
> +fio_config=$tmp.fio
> +fio_out=$tmp.fio.out
> +blksz=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
> +cat >$fio_config <<EOF
> +[global]
> +bs=8k
> +iodepth=1
> +randrepeat=1
> +size=256k
> +directory=$SCRATCH_MNT
> +numjobs=1
> +[job1]
> +ioengine=sync
> +bs=512
> +direct=1
> +rw=randread
> +filename=file1
> +[job2]
> +ioengine=libaio
> +rw=randwrite
> +direct=1
> +filename=file1
> +[job3]
> +ioengine=posixaio
> +rw=randwrite
> +filename=file1
> +EOF
> +_require_fio $fio_config
> +
> +for (( i = 0; i < $iterations; i++)); do
> +	_scratch_mkfs >>$seqres.full 2>&1
> +	_scratch_mount
> +	# There's a known EIO failure to report collisions between directio and buffered
> +	# writes to userspace, refer to upstream linux 5a9d929d6e13. So ignore EIO error
> +	# at here.
> +	#
> +	# And for btrfs if sector size < page size, if we have a partial
> +	# uptodate folio caused by a buffered write, e.g:
> +	#
> +	#    0          16K         32K          48K         64K
> +	#    |                                   |///////////|
> +	#					     \- sector Uptodate|Dirty
> +	#
> +	# Then writeback happens and finished, but btrfs' ordered extent not
> +	# yet finished.
> +	# In that case, the folio can be released from the page cache (since
> +	# the folio is not under IO/lock).
> +	#
> +	# Then new buffered writes into the folio happened, re-dirty the folio:
> +	#   0          16K         32K          48K         64K
> +	#   |//////////|                        |///////////|
> +	#      \- sector Uptodate|Dirty              \- No sector flags
> +	#                                               extent map PINNED
> +	#                                               OE still here
> +	#
> +	# Now read is triggered on that folio.
> +	# Btrfs will need to wait for any existing ordered extents in the folio range,
> +	# that wait will also trigger writeback if the folio is dirty.
> +	# That writeback will happen for range [48K, 64K), but since the whole folio
> +	# is locked for read, writeback will also try to lock the same folio, causing
> +	# a deadlock.
> +	$FIO_PROG $fio_config --ignore_error=,EIO --output=$fio_out

Looks like this test doesn't mix DIO and buffered IO, so this EIO ignoring might not be
needed.

> +	# umount before checking dmesg in case umount triggers any WARNING or Oops
> +	_scratch_unmount
> +
> +	_check_dmesg _filter_aiodio_dmesg

This test removed mmap test part, so this dmesg filter might not be needed either ?
If so, don't need to import above "./common/filter" either.

Others looks good to me.

Thanks,
Zorro

> +
> +	echo "=== fio $i/$iterations ===" >> $seqres.full
> +	cat $fio_out >> $seqres.full
> +done
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/366.out b/tests/generic/366.out
> new file mode 100644
> index 00000000..1fe90e06
> --- /dev/null
> +++ b/tests/generic/366.out
> @@ -0,0 +1,2 @@
> +QA output created by 366
> +Silence is golden
> -- 
> 2.46.0
> 
>