On Mon, Mar 11, 2024 at 10:14:15PM +0530, Anand Jain wrote: > On 3/8/24 23:13, Boris Burkov wrote: > > It is possible to confuse the btrfs device cache (fs_devices) by > > starting with a multi-device filesystem, then removing and re-adding a > > device in a way which changes its dev_t while the filesystem is > > unmounted. After this procedure, if we remount, then we are in a funny > > state where struct btrfs_device's "devt" field does not match the bd_dev > > of the "bdev" field. I would say this is bad enough, as we have violated > > a pretty clear invariant. > > > > But for style points, we can then remove the extra device from the fs, > > making it a single device fs, which enables the "temp_fsid" feature, > > which permits multiple separate mounts of different devices with the > > same fsid. Since btrfs is confused and *thinks* there are different > > devices (based on device->devt), it allows a second redundant mount of > > the same device (not a bind mount!). This then allows us to corrupt the > > original mount by doing stuff to the one that should be a bind mount. > > > > Signed-off-by: Boris Burkov <boris@xxxxxx> > > --- > > Changelog: > > v3: > > - fstests convention improvements (helpers, output, comments, etc...) > > v2: > > - fix numerous fundamental issues, v1 wasn't really ready > > > > common/config | 1 + > > tests/btrfs/311 | 105 ++++++++++++++++++++++++++++++++++++++++++++ > > tests/btrfs/311.out | 2 + > > 3 files changed, 108 insertions(+) > > create mode 100755 tests/btrfs/311 > > create mode 100644 tests/btrfs/311.out > > > > diff --git a/common/config b/common/config > > index a3b15b96f..43b517fda 100644 > > --- a/common/config > > +++ b/common/config > > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)" > > export GZIP_PROG="$(type -P gzip)" > > export BTRFS_IMAGE_PROG="$(type -P btrfs-image)" > > export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical) > > +export PARTED_PROG="$(type -P parted)" > > # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled. > > # newer systems have udevadm command but older systems like RHEL5 don't. > > diff --git a/tests/btrfs/311 b/tests/btrfs/311 > > new file mode 100755 > > index 000000000..a7fa541c4 > > --- /dev/null > > +++ b/tests/btrfs/311 > > @@ -0,0 +1,105 @@ > > +#! /bin/bash > > +# SPDX-License-Identifier: GPL-2.0 > > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved. > > +# > > +# FS QA Test 311 > > +# > > +# Test an edge case of multi device volume management in btrfs. > > +# If a device changes devt between mounts of a multi device fs, we can trick > > +# btrfs into mounting the same device twice fully (not as a bind mount). From > > +# there, it is trivial to induce corruption. > > +# > > +. ./common/preamble > > +_begin_fstest auto quick volume scrub > > Please add tmpfsid as well, because this test is about when not to activate > the tmpfsid. > > > + > > +# real QA test starts here > > +_supported_fs btrfs > > +_require_test > > +_require_command "$PARTED_PROG" parted > > +_require_batched_discard "$TEST_DIR" > > + > > _fixed_by_kernel_commit XXXXXXXXXXXX \ > btrfs: validate device maj:min during open > > > +_cleanup() { > > + cd / > > + $UMOUNT_PROG $MNT > > + $UMOUNT_PROG $BIND > > + losetup -d $DEV0 > > + losetup -d $DEV1 > > + losetup -d $DEV2 > > + rm $IMG0 > > + rm $IMG1 > > + rm $IMG2 > > +} > > + > > +IMG0=$TEST_DIR/$$.img0 > > +IMG1=$TEST_DIR/$$.img1 > > +IMG2=$TEST_DIR/$$.img2 > > +truncate -s 1G $IMG0 > > +truncate -s 1G $IMG1 > > +truncate -s 1G $IMG2 > > +DEV0=$(losetup -f $IMG0 --show) > > +DEV1=$(losetup -f $IMG1 --show) > > +DEV2=$(losetup -f $IMG2 --show) > > +D0P1=$DEV0"p1" > > +D1P1=$DEV1"p1" > > +MNT=$TEST_DIR/mnt > > +BIND=$TEST_DIR/bind > > + > > +# Setup partition table with one partition on each device. > > +$PARTED_PROG $DEV0 'mktable gpt' --script > > +$PARTED_PROG $DEV1 'mktable gpt' --script > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > > + > > +# mkfs with two devices to avoid clearing devices on close > > +# single raid to allow removing DEV2. > > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 >>$seqres.full 2>&1 || _fail "failed to mkfs.btrfs" > > Error out is already sufficient here. > I don't understand, but I think Filipe asked for this one. > > + > > +# Cycle mount the two device fs to populate both devices into the > > +# stale device cache. > > +mkdir -p $MNT > > +_mount $D0P1 $MNT > > +$UMOUNT_PROG $MNT > > + > > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date. > > +$PARTED_PROG $DEV0 'rm 1' --script > > +$PARTED_PROG $DEV1 'rm 1' --script > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > > + > > +# Mount with mismatched dev_t! > > +_mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point" > > + > > On a system where the kernel bug is fixed, the mount is expected to pass. > > On a system without the kernel bug fix, the mount is still expected to pass. > > However, the failure message indicates that it is advisable to fail the > mount in this scenario. > > I believe this will be the case once the btrfs-progs patch below is > integrated: > > [PATCH 0/2] btrfs-progs: forget removed devices > > So, we need to update the test case logic based on whether the above > btrfs-progs patch is integrated. > > The point of this was just to do our best to avoid doing a bunch of writes/scrubbing/trimming on the fs under all the loopdevs and stuff in an earlier version outside fstests. It would stink to run the script without set -e and blow up your root drive or whatever. With that said, this is running in TEST_DIR, so I think we can forget about the precaution now. > > +# Remove the extra device to bring temp-fsid back in the fray. > > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT > > + > > +# Create the would be bind mount. > > +mkdir -p $BIND > > +_mount $D0P1 $BIND > > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT) > > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND) > > +# If they're different, we are in trouble. > > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show" > > + > > > > +# Now really prove it by corrupting the first mount with the second. > > +for i in $(seq 20); do > > + $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >>$seqres.full 2>&1 > > +done > > +for i in $(seq 20); do > > + $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >>$seqres.full 2>&1 > > +done > > + > > > +# sync so that we really write the large file data out to the shared device > > +sync > > + > > +# now delete from one view of the shared device > > +find $BIND -type f -delete > > +# sync so that fstrim definitely has deleted data to trim > > +sync > > +# This should blow up both mounts, if the writes somehow didn't overlap at all. > > +$FSTRIM_PROG $BIND > > +# drop caches to improve the odds we read from the corrupted device while scrubbing. > > +echo 3 > /proc/sys/vm/drop_caches > > +$BTRFS_UTIL_PROG scrub start -B $MNT | grep "Error summary:" > > + > > > The rest appears to be fine. > > Question: Why didn't you choose a cp --reflink=always across $MNT and $BIND > to prove how kernel think about $MNT and $BIND. Didn't occur to me. Who knows, maybe one day cross-fs reflink will work ;) Demonstrating a corruption felt "elemental" to me. > > Thanks, Anand > > > +status=0 > > +exit > > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out > > new file mode 100644 > > index 000000000..70a6db809 > > --- /dev/null > > +++ b/tests/btrfs/311.out > > @@ -0,0 +1,2 @@ > > +QA output created by 311 > > +Error summary: no errors found >