On Tue, Oct 29, 2024 at 10:05:25AM +1030, Qu Wenruo wrote: > [BUG] > During the development to make btrfs pass generic/563 (which needs to > make btrfs to support partial folios), generic/095 causes hangs > during tests. > > The call trace for the hanging process looks like this: > > __switch_to+0xf8/0x168 > __schedule+0x328/0x8a8 > schedule+0x54/0x140 > io_schedule+0x44/0x68 > folio_wait_bit_common+0x198/0x3f8 > __folio_lock+0x24/0x40 > extent_write_cache_pages+0x2e0/0x4c0 [btrfs] > btrfs_writepages+0x94/0x158 [btrfs] > do_writepages+0x74/0x190 > filemap_fdatawrite_wbc+0x88/0xc8 > __filemap_fdatawrite_range+0x6c/0xa8 > filemap_fdatawrite_range+0x1c/0x30 > btrfs_start_ordered_extent+0x264/0x2e0 [btrfs] > btrfs_lock_and_flush_ordered_range+0x8c/0x160 [btrfs] > __get_extent_map+0xa0/0x220 [btrfs] > btrfs_do_readpage+0x1bc/0x5d8 [btrfs] > btrfs_read_folio+0x50/0xa0 [btrfs] > filemap_read_folio+0x54/0x110 > filemap_update_page+0x2e0/0x3b8 > filemap_get_pages+0x228/0x4d8 > filemap_read+0x11c/0x3b8 > btrfs_file_read_iter+0x74/0x90 [btrfs] > new_sync_read+0xd0/0x1d0 > vfs_read+0x1a0/0x1f0 > > [CAUSE] > The root cause is a btrfs specific behavior that during a folio read, we > can trigger writeback of the same folio, which will try to lock the same > folio already locked by the read process. > > The fix is already sent to the mailing list: > https://lore.kernel.org/linux-btrfs/62bf73ada7be2888d45a787c2b6fd252103a5d25.1729725088.git.wqu@xxxxxxxx/ > > This problem can only happen if all the following conditions are met: > > - The sector size of btrfs is smaller than page size > To have partial uptodate folios. > > - Btrfs won't read the full folio if buffered write is block aligned > This is done by the not yet merged patch: > https://lore.kernel.org/linux-btrfs/ac2639ec4e9ac176d33e95ef7ecf008fa6be5461.1727833878.git.wqu@xxxxxxxx/ > > [TEST CASE] > During the debugging of that generic/095 hang, I extracted a minimal > reproducer which is much smaller and faster, although it still requires > several runs to trigger a hang. > > The test case will run the fio workload 32 times by default, which is > more than enough to trigger the hang. > > Signed-off-by: Qu Wenruo <wqu@xxxxxxxx> > --- > Changelog: > v2: > - Fix the duplicated _fixed_by_kernel_commit line > - Fix a typo in the commit message > - Update the comments to show the workload and how it hangs btrfs > --- > tests/generic/366 | 106 ++++++++++++++++++++++++++++++++++++++++++ > tests/generic/366.out | 2 + > 2 files changed, 108 insertions(+) > create mode 100755 tests/generic/366 > create mode 100644 tests/generic/366.out > > diff --git a/tests/generic/366 b/tests/generic/366 > new file mode 100755 > index 00000000..9d31c1c8 > --- /dev/null > +++ b/tests/generic/366 > @@ -0,0 +1,106 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (C) 2024 SUSE Linux Products GmbH. All Rights Reserved. > +# > +# FS QA Test 366 > +# > +# Test if mixed direct read, direct write and buffered write on the same file will > +# hang the filesystem. > +# > +# This is exposed by an incoming btrfs feature, which allows a folio to be > +# partial uptodate if the buffered write range is block aligned but not yet > +# full folio aligned. > +# > +# Such behavior makes btrfs to hang reliably under generic/095. > +# This is the extracted minimal reproducer for 4k block size and 64K page size. > +# > +. ./common/preamble > +_begin_fstest auto quick ^^ rw I think we can keep the "rw" group as g/095 does. > + > +. ./common/filter > + > +_require_scratch > +_require_odirect > +_require_aio > + > +_fixed_by_kernel_commit xxxxxxxxxxxx \ > + "btrfs: avoid deadlock when reading a partial uptodate folio" > + > +iterations=$((32 * LOAD_FACTOR)) > + > +fio_config=$tmp.fio > +fio_out=$tmp.fio.out > +blksz=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV` > +cat >$fio_config <<EOF > +[global] > +bs=8k > +iodepth=1 > +randrepeat=1 > +size=256k > +directory=$SCRATCH_MNT > +numjobs=1 > +[job1] > +ioengine=sync > +bs=512 > +direct=1 > +rw=randread > +filename=file1 > +[job2] > +ioengine=libaio > +rw=randwrite > +direct=1 > +filename=file1 > +[job3] > +ioengine=posixaio > +rw=randwrite > +filename=file1 > +EOF > +_require_fio $fio_config > + > +for (( i = 0; i < $iterations; i++)); do > + _scratch_mkfs >>$seqres.full 2>&1 > + _scratch_mount > + # There's a known EIO failure to report collisions between directio and buffered > + # writes to userspace, refer to upstream linux 5a9d929d6e13. So ignore EIO error > + # at here. > + # > + # And for btrfs if sector size < page size, if we have a partial > + # uptodate folio caused by a buffered write, e.g: > + # > + # 0 16K 32K 48K 64K > + # | |///////////| > + # \- sector Uptodate|Dirty > + # > + # Then writeback happens and finished, but btrfs' ordered extent not > + # yet finished. > + # In that case, the folio can be released from the page cache (since > + # the folio is not under IO/lock). > + # > + # Then new buffered writes into the folio happened, re-dirty the folio: > + # 0 16K 32K 48K 64K > + # |//////////| |///////////| > + # \- sector Uptodate|Dirty \- No sector flags > + # extent map PINNED > + # OE still here > + # > + # Now read is triggered on that folio. > + # Btrfs will need to wait for any existing ordered extents in the folio range, > + # that wait will also trigger writeback if the folio is dirty. > + # That writeback will happen for range [48K, 64K), but since the whole folio > + # is locked for read, writeback will also try to lock the same folio, causing > + # a deadlock. > + $FIO_PROG $fio_config --ignore_error=,EIO --output=$fio_out Looks like this test doesn't mix DIO and buffered IO, so this EIO ignoring might not be needed. > + # umount before checking dmesg in case umount triggers any WARNING or Oops > + _scratch_unmount > + > + _check_dmesg _filter_aiodio_dmesg This test removed mmap test part, so this dmesg filter might not be needed either ? If so, don't need to import above "./common/filter" either. Others looks good to me. Thanks, Zorro > + > + echo "=== fio $i/$iterations ===" >> $seqres.full > + cat $fio_out >> $seqres.full > +done > + > +echo "Silence is golden" > + > +# success, all done > +status=0 > +exit > diff --git a/tests/generic/366.out b/tests/generic/366.out > new file mode 100644 > index 00000000..1fe90e06 > --- /dev/null > +++ b/tests/generic/366.out > @@ -0,0 +1,2 @@ > +QA output created by 366 > +Silence is golden > -- > 2.46.0 > >