Re: [PATCH v3] fstests: btrfs: zoned: verify RAID conversion with write pointer mismatch

Johannes Thumshirn <Johannes.Thumshirn@xxxxxxx> · Wed, 19 Mar 2025 11:03:13 +0000

On 19.03.25 02:18, Naohiro Aota wrote:
> On Tue Mar 18, 2025 at 10:17 PM JST, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>
>>
>> Recently we had a bug report about a kernel crash that happened when the
>> user was converting a filesystem to use RAID1 for metadata, but for some
>> reason the device's write pointers got out of sync.
>>
>> Test this scenario by manually injecting de-synchronized write pointer
>> positions and then running conversion to a metadata RAID1 filesystem.
>>
>> In the testcase also repair the broken filesystem and check if both system
>> and metadata block groups are back to the default 'DUP' profile
>> afterwards.
>>
>> Link: https://lore.kernel.org/linux-btrfs/CAB_b4sBhDe3tscz=duVyhc9hNE+gu=B8CrgLO152uMyanR8BEA@xxxxxxxxxxxxxx/
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>
>>
>> ---
>> Changes to v2:
>> - Filter SCRATCH_MNT in golden output
>> Changes to v1:
>> - Add test description
>> - Don't redirect stderr to $seqres.full
>> - Use xfs_io instead of dd
>> - Use $SCRATCH_MNT instead of hardcoded mount path
>> - Check that 1st balance command actually fails as it's supposed to
>> ---
>>   tests/btrfs/329     | 62 +++++++++++++++++++++++++++++++++++++++++++++
>>   tests/btrfs/329.out |  7 +++++
>>   2 files changed, 69 insertions(+)
>>   create mode 100755 tests/btrfs/329
>>   create mode 100644 tests/btrfs/329.out
>>
>> diff --git a/tests/btrfs/329 b/tests/btrfs/329
>> new file mode 100755
>> index 000000000000..5496866ac325
>> --- /dev/null
>> +++ b/tests/btrfs/329
>> @@ -0,0 +1,62 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2025 Western Digital Corporation.  All Rights Reserved.
>> +#
>> +# FS QA Test 329
>> +#
>> +# Regression test for a kernel crash when converting a zoned BTRFS from
>> +# metadata DUP to RAID1 and one of the devices has a non 0 write pointer
>> +# position in the target zone.
>> +#
>> +. ./common/preamble
>> +_begin_fstest zone quick volume
>> +
>> +. ./common/filter
>> +
>> +_fixed_by_kernel_commit XXXXXXXXXXXX \
>> +	"btrfs: zoned: return EIO on RAID1 block group write pointer mismatch"
>> +
>> +_require_scratch_dev_pool 2
>> +declare -a devs="( $SCRATCH_DEV_POOL )"
>> +_require_zoned_device ${devs[0]}
>> +_require_zoned_device ${devs[1]}
>> +_require_command "$BLKZONE_PROG" blkzone
>> +
>> +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
>> +_scratch_mount
>> +
>> +# Write some data to the FS to dirty it
>> +$XFS_IO_PROG -fc "pwrite 0 128M" $SCRATCH_MNT/test | _filter_xfs_io
>> +
>> +# Add device two to the FS
>> +$BTRFS_UTIL_PROG device add ${devs[1]} $SCRATCH_MNT >> $seqres.full
>> +
>> +# Move write pointers of all empty zones by 4k to simulate write pointer
>> +# mismatch.
>> +zones=$($BLKZONE_PROG report ${devs[1]} | $AWK_PROG '/em/ { print $2 }' |\
>> +	sed 's/,//')
> 
> Can we limit the number of zones to work with, in case we run this test
> on a huge device? I guess 2*(128M/4M)=64 would be enough.
> 

I.e. something like the following:

diff --git a/tests/btrfs/329 b/tests/btrfs/329
index 5496866ac325..24d34852db1f 100755
--- a/tests/btrfs/329
+++ b/tests/btrfs/329
@@ -33,8 +33,14 @@ $BTRFS_UTIL_PROG device add ${devs[1]} $SCRATCH_MNT >> $seqres.full

  # Move write pointers of all empty zones by 4k to simulate write pointer
  # mismatch.
+
+nzones=$($BLKZONE_PROG report ${devs[1]} | wc -l)
+if [ $nzones -gt 64 ]; then
+       nzones=64
+fi
+
  zones=$($BLKZONE_PROG report ${devs[1]} | $AWK_PROG '/em/ { print $2 }' |\
-       sed 's/,//')
+       sed 's/,//' | head -n $nzones)
  for zone in $zones;
  do

Yup this still triggers the bug on an unpatched kernel in my case and the
fix also fixes it.

So yes I'll update the testcase (I guess Filipe's R-b remains with this change).