Re: [PATCH] fstests: add test case to make sure btrfs can handle one corrupted device

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Wed, 27 Jul 2022 13:03:09 +0800

On 2022/7/27 12:57, Zorro Lang wrote:
On Tue, Jul 26, 2022 at 02:29:48PM +0800, Qu Wenruo wrote:
The new test case will verify that btrfs can handle one corrupted device
without affecting the consistency of the filesystem.

Unlike a missing device, one corrupted device can return garbage to the fs,
thus btrfs has to utilize its data/metadata checksum to verify which
data is correct.

The test case will:

- Create a small fs
   Mostly to speedup the test

- Fill the fs with a regular file

- Use fsstress to create some contents

- Save the fssum for later verification

- Corrupt one device with garbage but keep the primary superblock
   untouched

- Run fssum verification

- Run scrub to fix the fs

- Run scrub again to make sure the fs is fine

Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
---
  tests/btrfs/261     | 94 +++++++++++++++++++++++++++++++++++++++++++++
  tests/btrfs/261.out |  2 +
  2 files changed, 96 insertions(+)
  create mode 100755 tests/btrfs/261
  create mode 100644 tests/btrfs/261.out

diff --git a/tests/btrfs/261 b/tests/btrfs/261
new file mode 100755
index 00000000..15218e28
--- /dev/null
+++ b/tests/btrfs/261
@@ -0,0 +1,94 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2022 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test 261
+#
+# Make sure btrfs raid profiles can handling one corrupted device
+# without affecting the consistency of the fs.
+#
+. ./common/preamble
+_begin_fstest raid

Do you think about add it into auto group, or more groups?

Oh all my bad, was planning to add to auto later but forgot.


+
+. ./common/filter
+. ./common/populate
+
+_supported_fs btrfs
+_require_scratch_dev_pool 4
+_require_fssum
+
+prepare_fs()
+{
+	local profile=$1
+
+	# We don't want too large fs which can take too long to populate
+	# And the extra redirection of stderr is to avoid the RAID56 warning
+	# message to polluate the golden output
+	_scratch_pool_mkfs -m $profile -d $profile -b 1G >> $seqres.full 2>&1
+	if [ $? -ne 0 ]; then
+		echo "mkfs $mkfs_opts failed"
+		return
+	fi

If mkfs fails, below workload() will keep running, I think it's not what you
want, right?

Right, it should call _fail() instead.


+
+	# Disable compression, as compressed read repair is known to have problems
+	_scratch_mount -o compress=no
+
+	# Fill some part of the fs first
+	$XFS_IO_PROG -f -c "pwrite -S 0xfe 0 400M" $SCRATCH_MNT/garbage > /dev/null 2>&1
+
+	# Then use fsstress to generate some extra contents.
+	# Disable setattr related operations, as it may set NODATACOW which will
+	# not allow us to use btrfs checksum to verify the content.
+	$FSSTRESS_PROG -f setattr=0 -d $SCRATCH_MNT -w -n 3000 > /dev/null 2>&1
+	sync
+
+	# Save the fssum of this fs
+	$FSSUM_PROG -A -f -w $tmp.saved_fssum $SCRATCH_MNT
+	$BTRFS_UTIL_PROG fi show $SCRATCH_MNT >> $seqres.full
+	_scratch_unmount
+}
+
+workload()
+{
+	local target=$(echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $1}')

In common/config, we always set SCRATCH_DEV to the first of $SCRATCH_DEV_POOL,
as below:

         # a btrfs tester will set only SCRATCH_DEV_POOL, we will put first of its dev
         # to SCRATCH_DEV and rest to SCRATCH_DEV_POOL to maintain the backward compatibility
         if [ ! -z "$SCRATCH_DEV_POOL" ]; then
                 if [ ! -z "$SCRATCH_DEV" ]; then
                         echo "common/config: Error: \$SCRATCH_DEV ($SCRATCH_DEV) should be unset when \$SCRATCH_DEV_POOL ($SCRATCH_DEV_POOL) is set"
                         exit 1
                 fi
                 SCRATCH_DEV=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`

is that help?

Thanks, then I can use SCRATCH_DEV directly.


+	local profile=$1
+	local num_devs=$2
+
+	_scratch_dev_pool_get $num_devs
+	echo "=== Testing profile $profile ===" >> $seqres.full
+	rm -f -- $tmp.saved_fssum
+	prepare_fs $profile
+
+	# Corrupt the target device, only keep the superblock.
+	$XFS_IO_PROG -c "pwrite 1M 1023M" $target > /dev/null 2>&1

Do you need "_require_scratch_nocheck", if you corrupt the fs purposely?

I don't think we need that.

The point here is, we will immediately mount the fs and do a RW scrub,
which should make all devices good.

Thus if fsck after the test case found something wrong, it means the RW
scrub doesn't properly repair the fs.

Thanks for the review,
Qu


But I think it won't check the SCRATCH_DEV currently, due to you even don't use
_require_scratch, and the _require_scratch_dev_pool might not trigger a fsck at
the end of the testing.

Hmm... I prefer having a "_require_scratch_nocheck", and you can use $SCRATCH_DEV
to replace your "$target".

+
+	_scratch_mount
+
+	# All content should be fine
+	$FSSUM_PROG -r $tmp.saved_fssum $SCRATCH_MNT > /dev/null
+
+	# Scrub to fix the fs, this is known to report various correctable
+	# errors
+	$BTRFS_UTIL_PROG scrub start -B $SCRATCH_MNT >> $seqres.full 2>&1
+
+	# Make sure above scrub fixed the fs
+	$BTRFS_UTIL_PROG scrub start -Br $SCRATCH_MNT >> $seqres.full
+	if [ $? -ne 0 ]; then
+		echo "scrub failed to fix the fs for profile $profile"
+	fi
+	_scratch_unmount
+	_scratch_dev_pool_put
+}
+
+workload raid1 2
+workload raid1c3 3
+workload raid1c4 4
+workload raid10 4
+workload raid5 3
+workload raid6 4
+
+echo "Silence is golden"
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/261.out b/tests/btrfs/261.out
new file mode 100644
index 00000000..679ddc0f
--- /dev/null
+++ b/tests/btrfs/261.out
@@ -0,0 +1,2 @@
+QA output created by 261
+Silence is golden
--
2.36.1