Re: [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Mon, 10 Apr 2023 14:42:54 +0800

On 2023/4/10 12:20, Anand Jain wrote:
On 10/4/23 10:22, Qu Wenruo wrote:
This is for pre-6.4 kernels, as scrub code goes through a huge rework.

[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replace and let it
finish just as there is nothing wrong:

  BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 
(devid 1) to /dev/mapper/test-scratch2 started
  BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
csum, has 0x00000000 want 0xade80ca1
  BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad 
csum, has 0x00000000 want 0xade80ca1
  BTRFS warning (device dm-4): checksum error at logical 5578752 on 
dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 
0) in tree 5
  BTRFS warning (device dm-4): checksum error at logical 5578752 on 
dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 
0) in tree 5
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 
0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
bytenr, has 0 want 5578752
  BTRFS error (device dm-4): unable to fixup (regular) error at 
logical 5578752 on dev /dev/mapper/test-scratch1
  BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 
(devid 1) to /dev/mapper/test-scratch2 finished

This can lead to unexpected problems for the result fs.

[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.

But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.

And even if we checks the progress, we can not really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).

This behavior is there from the very beginning.

[FIX]
Instead of continue the replace, just error out if we hit an unrepaired
metadata sector.

Now the dev-replace would be rejected with -EIO, to inform the user.
Although it also means, the fs has some metadata error which can not be
repaired, the user would be super upset anyway.

IMO, the original design is fair as it does not capture scrub errors
during the replace. Because the purpose of the scrub is different from
the replace from the user POV.

The problem is, after such replace, the corrupted metadata would have 
different content (we just don't do the writeback at all).
Even worse, the end user is not even aware of the problem, unless dmesg 
is manually checked.

This means we changed the result fs during the replace, which removes 
the tiny chance to do a manual repair (aka, manually re-generate the 
checksum).

However, after the replace, if scrubbed it will still capture any
errors? No?

It's not about scrub after scrub. Such replace should not even finish.

Thanks,
Qu

Thanks, Anand



The new dmesg would look like this:

  BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 
(devid 1) to /dev/mapper/test-scratch2 started
  BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
csum, has 0x00000000 want 0xade80ca1
  BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
csum, has 0x00000000 want 0xade80ca1
  BTRFS error (device dm-4): unable to fixup (regular) error at 
logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
  BTRFS warning (device dm-4): header error at logical 5570560 on dev 
/dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) 
in tree 5
  BTRFS warning (device dm-4): header error at logical 5570560 on dev 
/dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) 
in tree 5
  BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata 
sector at 5578752
  BTRFS error (device dm-4): 
btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, 
/dev/mapper/test-scratch2) failed -5

CC: stable@xxxxxxxxxxxxxxx
Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
---
I'm not sure how should we merge this patch.

The misc-next is already merging the new scrub code, but the problem is
there for all old kernels thus we need such fixes.

Maybe we can merge this fix before the scrub rework, then the rework,
and finally the better fix using reworked interface?
---
  fs/btrfs/scrub.c | 9 +++++++++
  1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ef4046a2572c..71f64b9bcd9f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -195,6 +195,7 @@ struct scrub_ctx {
      struct mutex            wr_lock;
      struct btrfs_device     *wr_tgtdev;
      bool                    flush_all_writes;
+    bool            has_meta_failed;
      /*
       * statistics
@@ -1380,6 +1381,8 @@ static int scrub_handle_errored_block(struct 
scrub_block *sblock_to_check)
          btrfs_err_rl_in_rcu(fs_info,
              "unable to fixup (regular) error at logical %llu on dev 
%s",
              logical, btrfs_dev_name(dev));
+        if (is_metadata)
+            sctx->has_meta_failed = true;
      }
  out:
@@ -3838,6 +3841,12 @@ static noinline_for_stack int 
scrub_stripe(struct scrub_ctx *sctx,
      blk_finish_plug(&plug);
+    /*
+     * If we have metadata unable to be repaired, we should error
+     * out the dev-replace.
+     */
+    if (sctx->is_dev_replace && sctx->has_meta_failed && ret >= 0)
+        ret = -EIO;
      if (sctx->is_dev_replace && ret >= 0) {
          int ret2;