On Thu, Mar 7, 2024 at 8:48 PM Boris Burkov <boris@xxxxxx> wrote: > > On Tue, Mar 05, 2024 at 12:13:07PM +0000, Filipe Manana wrote: > > On Fri, Mar 1, 2024 at 12:02 AM Boris Burkov <boris@xxxxxx> wrote: > > > > > > It is possible to confuse the btrfs device cache (fs_devices) by > > > starting with a multi-device filesystem, then removing and re-adding a > > > device in a way which changes its dev_t while the filesystem is > > > unmounted. After this procedure, if we remount, then we are in a funny > > > state where struct btrfs_device's "devt" field does not match the bd_dev > > > of the "bdev" field. I would say this is bad enough, as we have violated > > > a pretty clear invariant. > > > > > > But for style points, we can then remove the extra device from the fs, > > > making it a single device fs, which enables the "temp_fsid" feature, > > > which permits multiple separate mounts of different devices with the > > > same fsid. Since btrfs is confused and *thinks* there are different > > > devices (based on device->devt), it allows a second redundant mount of > > > the same device (not a bind mount!). This then allows us to corrupt the > > > original mount by doing stuff to the one that should be a bind mount. > > > > > > This is fixed by the combination of the kernel patch: > > btrfs: support device name lookup in forget > > > and the btrfs-progs patches: > > > btrfs-progs: allow btrfs device scan -u on dead dev > > > btrfs-progs: add udev rule to forget removed device > > > > May I suggest to make this more readable, easier to the eye? > > My inserting blank lines and indenting the lines with the patch > > subjects, like for example: > > > > """ > > This is fixed by the combination of the kernel patch: > > > > btrfs: support device name lookup in forget > > > > and the btrfs-progs patches: > > > > btrfs-progs: allow btrfs device scan -u on dead dev > > btrfs-progs: add udev rule to forget removed device > > """ > > > > And these should be placed in the test case itself with: > > > > _fixed_by_git_commit btrfs-progs xxxxxxxxxx "btrfs-progs: allow > > btrfs device scan -u on dead dev" > > > > For btrfs-progs commits, and for the kernel: > > > > _fixed_by_kernel_commit xxxxxxxxxxxx "btrfs: support device name > > lookup in forget" > > I'll do so going forward, thanks. > > > > > I see however that there's discussion for those patches between you, > > Anand and David, and it > > seems there's a chance those patches won't be merged to fix this bug, > > especially the kernel one > > for which Anand submitted an alternative. If those are added to the > > test case itself, can always be > > updated later if needed. > > I'm leaning towards omitting this part from the commit message and we > can update the test case once we pick a winner. Is that OK with you? Yes, thanks. > > > > > Also avoid putting the test number in the subject - there's no > > guarantee it will end up with that number when merged. > > > > > > > > Signed-off-by: Boris Burkov <boris@xxxxxx> > > > --- > > > common/config | 1 + > > > common/rc | 4 ++ > > > tests/btrfs/311 | 101 ++++++++++++++++++++++++++++++++++++++++++++ > > > tests/btrfs/311.out | 2 + > > > 4 files changed, 108 insertions(+) > > > create mode 100755 tests/btrfs/311 > > > create mode 100644 tests/btrfs/311.out > > > > > > diff --git a/common/config b/common/config > > > index a3b15b96f..43b517fda 100644 > > > --- a/common/config > > > +++ b/common/config > > > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)" > > > export GZIP_PROG="$(type -P gzip)" > > > export BTRFS_IMAGE_PROG="$(type -P btrfs-image)" > > > export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical) > > > +export PARTED_PROG="$(type -P parted)" > > > > > > # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled. > > > # newer systems have udevadm command but older systems like RHEL5 don't. > > > diff --git a/common/rc b/common/rc > > > index 30c44dddd..8e009aca9 100644 > > > --- a/common/rc > > > +++ b/common/rc > > > @@ -5375,6 +5375,10 @@ _require_unshare() { > > > _notrun "unshare $*: command not found, should be in util-linux" > > > } > > > > > > +_require_parted() { > > > > These three last functions from common/rc use that style, the { after > > the function name in the same line, > > but everywhere else the { is on a new line, and that's the most common > > style in fstests. > > I wish we had some consistency. > > > > > + $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found" > > > > Why not just call: > > > > _require_command "$PARTED_PROG" parted > > > > Could even do that in the test and no need for a common function in > > this file, as we do in many other tests > > (grep for "'_require_command'" in tests/generic for example). > > > > > +} > > > + > > > # Return a random file in a directory. A directory is *not* followed > > > # recursively. > > > _random_file() { > > > diff --git a/tests/btrfs/311 b/tests/btrfs/311 > > > new file mode 100755 > > > index 000000000..887c46ba0 > > > --- /dev/null > > > +++ b/tests/btrfs/311 > > > @@ -0,0 +1,101 @@ > > > +#! /bin/bash > > > +# SPDX-License-Identifier: GPL-2.0 > > > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved. > > > +# > > > +# FS QA Test 311 > > > +# > > > +# Test an edge case of multi device volume management in btrfs. > > > +# If a device changes devt between mounts of a multi device fs, we can trick > > > +# btrfs into mounting the same device twice fully (not as a bind mount). From > > > +# there, it is trivial to induce corruption. > > > +# > > > +. ./common/preamble > > > +_begin_fstest auto quick volume > > > > Missing 'scrub' group. > > > > > + > > > +# real QA test starts here > > > +_supported_fs btrfs > > > +_require_test > > > +_require_parted > > > + > > > +_cleanup() { > > > + cd / > > > + umount $MNT > > > + umount $BIND > > > + losetup -d $DEV0 > > > + losetup -d $DEV1 > > > + losetup -d $DEV2 > > > + rm $IMG0 > > > + rm $IMG1 > > > + rm $IMG2 > > > +} > > > + > > > +IMG0=$TEST_DIR/$$.img0 > > > +IMG1=$TEST_DIR/$$.img1 > > > +IMG2=$TEST_DIR/$$.img2 > > > +truncate -s 1G $IMG0 > > > +truncate -s 1G $IMG1 > > > +truncate -s 1G $IMG2 > > > +DEV0=$(losetup -f $IMG0 --show) > > > +DEV1=$(losetup -f $IMG1 --show) > > > +DEV2=$(losetup -f $IMG2 --show) > > > +D0P1=$DEV0"p1" > > > +D1P1=$DEV1"p1" > > > +MNT=$TEST_DIR/mnt > > > +BIND=$TEST_DIR/bind > > > + > > > +# Setup partition table with one partition on each device. > > > +$PARTED_PROG $DEV0 'mktable gpt' --script > > > +$PARTED_PROG $DEV1 'mktable gpt' --script > > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > > > + > > > +# mkfs with two devices to avoid clearing devices on close > > > +# single raid to allow removing DEV2. > > > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null > > > > Please redirect to the $seqres.full instead, and _fail if mkfs fails. > > That's what we do nowadays due to unpleasant surprises in the past. > > > > > + > > > +# Cycle mount the two device fs to populate both devices into the > > > +# stale device cache. > > > +mkdir -p $MNT > > > +mount $D0P1 $MNT > > > > Use the _mount() helper. > > > > > +umount $MNT > > > > We use $UMOUNT_PROG in fstests. > > > > > + > > > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date. > > > +$PARTED_PROG $DEV0 'rm 1' --script > > > +$PARTED_PROG $DEV1 'rm 1' --script > > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > > > + > > > +# Mount with mismatched dev_t! > > > +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point" > > > > Same here. > > > > > + > > > +# Remove the extra device to bring temp-fsid back in the fray. > > > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT > > > + > > > +# Create the would be bind mount. > > > +mkdir -p $BIND > > > +mount $D0P1 $BIND > > > > Same here. > > > > > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT) > > > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND) > > > +# If they're different, we are in trouble. > > > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show" > > > + > > > +# Now really prove it by corrupting the first mount with the second. > > > +for i in $(seq 20); do > > > + $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1 > > > > This is overriding the .full file, use >> > > > > > +done > > > +for i in $(seq 20); do > > > + $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1 > > > > Same here, this is overriding the .full file, use >> > > > > > +done > > > +sync > > > > Can we please have a comment mentioning why the sync is needed? > > > > > +find $BIND -type f -delete > > > +sync > > > > Same here. > > > > > + > > > +# This should blow up both mounts, if the writes somehow didn't overlap at all. > > > +$FSTRIM_PROG $BIND > > > > Since it's using the fstrim program and needs trim to be supported, > > the test misses a: > > > > _require_batched_discard "$TEST_DIR" > > > > at the top. > > > > > +echo 3 > /proc/sys/vm/drop_caches > > > > Please add a comment mentioning why we are dropping caches. > > > > > +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1 > > > > The test passes whether scrub fails or succeeds, as it's redirecting > > stdout and stderr to the .full file and ignoring the exit status of > > the command. > > > > The ideal fstests way is to put the expected output in the golden > > output file and just call the command without redirecting anything. > > If that's not doable for some reason, at the very least do a (...) || > > echo "Scrub failed, check $seqres.full for details." > > > > Thanks. > > > > > + > > > +# success, all done > > > +echo "Silence is golden" > > > +status=0 > > > +exit > > > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out > > > new file mode 100644 > > > index 000000000..62f253029 > > > --- /dev/null > > > +++ b/tests/btrfs/311.out > > > @@ -0,0 +1,2 @@ > > > +QA output created by 311 > > > +Silence is golden > > > -- > > > 2.43.0 > > > > > >