Re: [PATCH v2] btrfs/311: new test for devt change between mounts

Boris Burkov <boris@xxxxxx> · Thu, 7 Mar 2024 12:49:36 -0800

On Tue, Mar 05, 2024 at 12:13:07PM +0000, Filipe Manana wrote:
> On Fri, Mar 1, 2024 at 12:02 AM Boris Burkov <boris@xxxxxx> wrote:
> >
> > It is possible to confuse the btrfs device cache (fs_devices) by
> > starting with a multi-device filesystem, then removing and re-adding a
> > device in a way which changes its dev_t while the filesystem is
> > unmounted. After this procedure, if we remount, then we are in a funny
> > state where struct btrfs_device's "devt" field does not match the bd_dev
> > of the "bdev" field. I would say this is bad enough, as we have violated
> > a pretty clear invariant.
> >
> > But for style points, we can then remove the extra device from the fs,
> > making it a single device fs, which enables the "temp_fsid" feature,
> > which permits multiple separate mounts of different devices with the
> > same fsid. Since btrfs is confused and *thinks* there are different
> > devices (based on device->devt), it allows a second redundant mount of
> > the same device (not a bind mount!). This then allows us to corrupt the
> > original mount by doing stuff to the one that should be a bind mount.
> >
> > This is fixed by the combination of the kernel patch: > > btrfs: support device name lookup in forget
> > and the btrfs-progs patches:
> > btrfs-progs: allow btrfs device scan -u on dead dev
> > btrfs-progs: add udev rule to forget removed device
> 
> May I suggest to make this more readable, easier to the eye?
> My inserting blank lines and indenting the lines with the patch
> subjects, like for example:
> 
> """
> This is fixed by the combination of the kernel patch:
> 
>     btrfs: support device name lookup in forget
> 
> and the btrfs-progs patches:
> 
>    btrfs-progs: allow btrfs device scan -u on dead dev
>    btrfs-progs: add udev rule to forget removed device
> """
> 
> And these should be placed in the test case itself with:
> 
>    _fixed_by_git_commit btrfs-progs xxxxxxxxxx "btrfs-progs: allow
> btrfs device scan -u on dead dev"
> 
> For btrfs-progs commits, and for the kernel:
> 
>    _fixed_by_kernel_commit xxxxxxxxxxxx "btrfs: support device name
> lookup in forget"

I'll do so going forward, thanks.

> 
> I see however that there's discussion for those patches between you,
> Anand and David, and it
> seems there's a chance those patches won't be merged to fix this bug,
> especially the kernel one
> for which Anand submitted an alternative. If those are added to the
> test case itself, can always be
> updated later if needed.

I'm leaning towards omitting this part from the commit message and we
can update the test case once we pick a winner. Is that OK with you?

> 
> Also avoid putting the test number in the subject - there's no
> guarantee it will end up with that number when merged.
> 
> >
> > Signed-off-by: Boris Burkov <boris@xxxxxx>
> > ---
> >  common/config       |   1 +
> >  common/rc           |   4 ++
> >  tests/btrfs/311     | 101 ++++++++++++++++++++++++++++++++++++++++++++
> >  tests/btrfs/311.out |   2 +
> >  4 files changed, 108 insertions(+)
> >  create mode 100755 tests/btrfs/311
> >  create mode 100644 tests/btrfs/311.out
> >
> > diff --git a/common/config b/common/config
> > index a3b15b96f..43b517fda 100644
> > --- a/common/config
> > +++ b/common/config
> > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)"
> >  export GZIP_PROG="$(type -P gzip)"
> >  export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
> >  export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
> > +export PARTED_PROG="$(type -P parted)"
> >
> >  # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
> >  # newer systems have udevadm command but older systems like RHEL5 don't.
> > diff --git a/common/rc b/common/rc
> > index 30c44dddd..8e009aca9 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -5375,6 +5375,10 @@ _require_unshare() {
> >                 _notrun "unshare $*: command not found, should be in util-linux"
> >  }
> >
> > +_require_parted() {
> 
> These three last functions from common/rc use that style, the { after
> the function name in the same line,
> but everywhere else the { is on a new line, and that's the most common
> style in fstests.
> I wish we had some consistency.
> 
> > +       $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found"
> 
> Why not just call:
> 
> _require_command "$PARTED_PROG" parted
> 
> Could even do that in the test and no need for a common function in
> this file, as we do in many other tests
> (grep for "'_require_command'" in tests/generic for example).
> 
> > +}
> > +
> >  # Return a random file in a directory. A directory is *not* followed
> >  # recursively.
> >  _random_file() {
> > diff --git a/tests/btrfs/311 b/tests/btrfs/311
> > new file mode 100755
> > index 000000000..887c46ba0
> > --- /dev/null
> > +++ b/tests/btrfs/311
> > @@ -0,0 +1,101 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
> > +#
> > +# FS QA Test 311
> > +#
> > +# Test an edge case of multi device volume management in btrfs.
> > +# If a device changes devt between mounts of a multi device fs, we can trick
> > +# btrfs into mounting the same device twice fully (not as a bind mount). From
> > +# there, it is trivial to induce corruption.
> > +#
> > +. ./common/preamble
> > +_begin_fstest auto quick volume
> 
> Missing 'scrub' group.
> 
> > +
> > +# real QA test starts here
> > +_supported_fs btrfs
> > +_require_test
> > +_require_parted
> > +
> > +_cleanup() {
> > +       cd /
> > +       umount $MNT
> > +       umount $BIND
> > +       losetup -d $DEV0
> > +       losetup -d $DEV1
> > +       losetup -d $DEV2
> > +       rm $IMG0
> > +       rm $IMG1
> > +       rm $IMG2
> > +}
> > +
> > +IMG0=$TEST_DIR/$$.img0
> > +IMG1=$TEST_DIR/$$.img1
> > +IMG2=$TEST_DIR/$$.img2
> > +truncate -s 1G $IMG0
> > +truncate -s 1G $IMG1
> > +truncate -s 1G $IMG2
> > +DEV0=$(losetup -f $IMG0 --show)
> > +DEV1=$(losetup -f $IMG1 --show)
> > +DEV2=$(losetup -f $IMG2 --show)
> > +D0P1=$DEV0"p1"
> > +D1P1=$DEV1"p1"
> > +MNT=$TEST_DIR/mnt
> > +BIND=$TEST_DIR/bind
> > +
> > +# Setup partition table with one partition on each device.
> > +$PARTED_PROG $DEV0 'mktable gpt' --script
> > +$PARTED_PROG $DEV1 'mktable gpt' --script
> > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > +
> > +# mkfs with two devices to avoid clearing devices on close
> > +# single raid to allow removing DEV2.
> > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null
> 
> Please redirect to the $seqres.full instead, and _fail if mkfs fails.
> That's what we do nowadays due to unpleasant surprises in the past.
> 
> > +
> > +# Cycle mount the two device fs to populate both devices into the
> > +# stale device cache.
> > +mkdir -p $MNT
> > +mount $D0P1 $MNT
> 
> Use the _mount() helper.
> 
> > +umount $MNT
> 
> We use $UMOUNT_PROG in fstests.
> 
> > +
> > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
> > +$PARTED_PROG $DEV0 'rm 1' --script
> > +$PARTED_PROG $DEV1 'rm 1' --script
> > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > +
> > +# Mount with mismatched dev_t!
> > +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
> 
> Same here.
> 
> > +
> > +# Remove the extra device to bring temp-fsid back in the fray.
> > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT
> > +
> > +# Create the would be bind mount.
> > +mkdir -p $BIND
> > +mount $D0P1 $BIND
> 
> Same here.
> 
> > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
> > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
> > +# If they're different, we are in trouble.
> > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
> > +
> > +# Now really prove it by corrupting the first mount with the second.
> > +for i in $(seq 20); do
> > +       $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1
> 
> This is overriding the .full file, use >>
> 
> > +done
> > +for i in $(seq 20); do
> > +       $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1
> 
> Same here, this is overriding the .full file, use >>
> 
> > +done
> > +sync
> 
> Can we please have a comment mentioning why the sync is needed?
> 
> > +find $BIND -type f -delete
> > +sync
> 
> Same here.
> 
> > +
> > +# This should blow up both mounts, if the writes somehow didn't overlap at all.
> > +$FSTRIM_PROG $BIND
> 
> Since it's using the fstrim program and needs trim to be supported,
> the test misses a:
> 
> _require_batched_discard "$TEST_DIR"
> 
> at the top.
> 
> > +echo 3 > /proc/sys/vm/drop_caches
> 
> Please add a comment mentioning why we are dropping caches.
> 
> > +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1
> 
> The test passes whether scrub fails or succeeds, as it's redirecting
> stdout and stderr to the .full file and ignoring the exit status of
> the command.
> 
> The ideal fstests way is to put the expected output in the golden
> output file and just call the command without redirecting anything.
> If that's not doable for some reason, at the very least do a (...) ||
> echo "Scrub failed, check $seqres.full for details."
> 
> Thanks.
> 
> > +
> > +# success, all done
> > +echo "Silence is golden"
> > +status=0
> > +exit
> > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
> > new file mode 100644
> > index 000000000..62f253029
> > --- /dev/null
> > +++ b/tests/btrfs/311.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 311
> > +Silence is golden
> > --
> > 2.43.0
> >
> >