Re: [PATCH] fstests: add test case to make sure btrfs can handle one corrupted device

Zorro Lang <zlang@xxxxxxxxxx> · Wed, 27 Jul 2022 12:57:36 +0800

On Tue, Jul 26, 2022 at 02:29:48PM +0800, Qu Wenruo wrote:
> The new test case will verify that btrfs can handle one corrupted device
> without affecting the consistency of the filesystem.
> 
> Unlike a missing device, one corrupted device can return garbage to the fs,
> thus btrfs has to utilize its data/metadata checksum to verify which
> data is correct.
> 
> The test case will:
> 
> - Create a small fs
>   Mostly to speedup the test
> 
> - Fill the fs with a regular file
> 
> - Use fsstress to create some contents
> 
> - Save the fssum for later verification
> 
> - Corrupt one device with garbage but keep the primary superblock
>   untouched
> 
> - Run fssum verification
> 
> - Run scrub to fix the fs
> 
> - Run scrub again to make sure the fs is fine
> 
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
>  tests/btrfs/261     | 94 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/btrfs/261.out |  2 +
>  2 files changed, 96 insertions(+)
>  create mode 100755 tests/btrfs/261
>  create mode 100644 tests/btrfs/261.out
> 
> diff --git a/tests/btrfs/261 b/tests/btrfs/261
> new file mode 100755
> index 00000000..15218e28
> --- /dev/null
> +++ b/tests/btrfs/261
> @@ -0,0 +1,94 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2022 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test 261
> +#
> +# Make sure btrfs raid profiles can handling one corrupted device
> +# without affecting the consistency of the fs.
> +#
> +. ./common/preamble
> +_begin_fstest raid

Do you think about add it into auto group, or more groups?

> +
> +. ./common/filter
> +. ./common/populate
> +
> +_supported_fs btrfs
> +_require_scratch_dev_pool 4
> +_require_fssum
> +
> +prepare_fs()
> +{
> +	local profile=$1
> +
> +	# We don't want too large fs which can take too long to populate
> +	# And the extra redirection of stderr is to avoid the RAID56 warning
> +	# message to polluate the golden output
> +	_scratch_pool_mkfs -m $profile -d $profile -b 1G >> $seqres.full 2>&1
> +	if [ $? -ne 0 ]; then
> +		echo "mkfs $mkfs_opts failed"
> +		return
> +	fi

If mkfs fails, below workload() will keep running, I think it's not what you
want, right?

> +
> +	# Disable compression, as compressed read repair is known to have problems
> +	_scratch_mount -o compress=no
> +
> +	# Fill some part of the fs first
> +	$XFS_IO_PROG -f -c "pwrite -S 0xfe 0 400M" $SCRATCH_MNT/garbage > /dev/null 2>&1
> +
> +	# Then use fsstress to generate some extra contents.
> +	# Disable setattr related operations, as it may set NODATACOW which will
> +	# not allow us to use btrfs checksum to verify the content.
> +	$FSSTRESS_PROG -f setattr=0 -d $SCRATCH_MNT -w -n 3000 > /dev/null 2>&1
> +	sync
> +
> +	# Save the fssum of this fs
> +	$FSSUM_PROG -A -f -w $tmp.saved_fssum $SCRATCH_MNT
> +	$BTRFS_UTIL_PROG fi show $SCRATCH_MNT >> $seqres.full
> +	_scratch_unmount
> +}
> +
> +workload()
> +{
> +	local target=$(echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $1}')

In common/config, we always set SCRATCH_DEV to the first of $SCRATCH_DEV_POOL,
as below:

        # a btrfs tester will set only SCRATCH_DEV_POOL, we will put first of its dev
        # to SCRATCH_DEV and rest to SCRATCH_DEV_POOL to maintain the backward compatibility
        if [ ! -z "$SCRATCH_DEV_POOL" ]; then
                if [ ! -z "$SCRATCH_DEV" ]; then
                        echo "common/config: Error: \$SCRATCH_DEV ($SCRATCH_DEV) should be unset when \$SCRATCH_DEV_POOL ($SCRATCH_DEV_POOL) is set"
                        exit 1
                fi
                SCRATCH_DEV=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`

is that help?

> +	local profile=$1
> +	local num_devs=$2
> +
> +	_scratch_dev_pool_get $num_devs
> +	echo "=== Testing profile $profile ===" >> $seqres.full
> +	rm -f -- $tmp.saved_fssum
> +	prepare_fs $profile
> +
> +	# Corrupt the target device, only keep the superblock.
> +	$XFS_IO_PROG -c "pwrite 1M 1023M" $target > /dev/null 2>&1

Do you need "_require_scratch_nocheck", if you corrupt the fs purposely?

But I think it won't check the SCRATCH_DEV currently, due to you even don't use
_require_scratch, and the _require_scratch_dev_pool might not trigger a fsck at
the end of the testing.

Hmm... I prefer having a "_require_scratch_nocheck", and you can use $SCRATCH_DEV
to replace your "$target".

> +
> +	_scratch_mount
> +
> +	# All content should be fine
> +	$FSSUM_PROG -r $tmp.saved_fssum $SCRATCH_MNT > /dev/null
> +
> +	# Scrub to fix the fs, this is known to report various correctable
> +	# errors
> +	$BTRFS_UTIL_PROG scrub start -B $SCRATCH_MNT >> $seqres.full 2>&1
> +
> +	# Make sure above scrub fixed the fs
> +	$BTRFS_UTIL_PROG scrub start -Br $SCRATCH_MNT >> $seqres.full
> +	if [ $? -ne 0 ]; then
> +		echo "scrub failed to fix the fs for profile $profile"
> +	fi
> +	_scratch_unmount
> +	_scratch_dev_pool_put
> +}
> +
> +workload raid1 2
> +workload raid1c3 3
> +workload raid1c4 4
> +workload raid10 4
> +workload raid5 3
> +workload raid6 4
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/261.out b/tests/btrfs/261.out
> new file mode 100644
> index 00000000..679ddc0f
> --- /dev/null
> +++ b/tests/btrfs/261.out
> @@ -0,0 +1,2 @@
> +QA output created by 261
> +Silence is golden
> -- 
> 2.36.1
>