On Thu, Jun 13, 2024 at 11:55:37AM +0200, Christian Brauner wrote: > On Wed, Jun 12, 2024 at 03:47:16PM GMT, Theodore Ts'o wrote: > > On Wed, Jun 12, 2024 at 01:25:07PM +0200, Christian Brauner wrote: > > > I've been trying to reproduce this with pmem yesterday and wasn't able to. > > > > > > What's the kernel config and test config that's used? > > > > > > > The kernel config can be found here: > > > > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/kernel-configs/config-6.1 > > > > Drop it into .config in the build directory of any kernel sources > > newer than 6.1, and then run "make olddefconfig". This is all > > automated in the install-kconfig script which I use: > > > > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig > > > > The VM has 4 CPU's, and 26GiB of memory, and kernel is booted with the > > boot command line options "memmap=4G!9G memmap=9G!14G", which sets up > > fake /dev/pmem0 and /dev/pmem1 devices backed by RAM. This is my poor > > engineer's way of testing DAX without needing to get access to > > expensive VM's with pmem. :-) > > > > I'm assuming this is a timing-dependant bug which is easiest to > > trigger on fast devices, so a ramdisk might also work. FWIW, I also > > can see failures relatively frequently using the ext4/nojournal > > configuration on a SSD-backed cloud block device (GCE's Persistent > > Disk SSD product). > > > > As a result, if you grab my xfstests-bld repo from github, and then > > run "qemu-xfstests -c ext4/nojournal C 20 generic/085" it should > > also reproduce. See the Documentation/kvm-quickstart.md for more details. > > Thanks, Ted! Ok, I think I figured it out. > > P1 > dm_resume() > -> bdev_freeze() > mutex_lock(&bdev->bd_fsfreeze_mutex); > atomic_inc_return(&bdev->bd_fsfreeze_count); // == 1 > mutex_unlock(&bdev->bd_fsfreeze_mutex); > > P2 P3 > setup_bdev_super() > bdev_file_open_by_dev(); > atomic_read(&bdev->bd_fsfreeze_count); // != 0 > > bdev_thaw() > mutex_lock(&bdev->bd_fsfreeze_mutex); > atomic_dec_return(&bdev->bd_fsfreeze_count); // == 0 > mutex_unlock(&bdev->bd_fsfreeze_mutex); > bd_holder_lock(); > // grab passive reference on sb via sb->s_count > bd_holder_unlock(); > // Sleep to be woken when superblock ready or dead > bdev_fput() > bd_holder_lock() > // yield bdev > bd_holder_unlock() > > deactivate_locked_super() > // notify that superblock is dead > > // get woken and see that superblock is dead; fail > > In words this basically means that P1 calls dm_suspend() which calls > into bdev_freeze() before the block device has been claimed by the > filesystem. This brings bdev->bd_fsfreeze_count to 1 and no call into > fs_bdev_freeze() is required. > > Now P2 tries to mount that frozen block device. It claims it and checks > bdev->bd_fsfreeze_count. As it's elevated it aborts mounting holding > sb->s_umount all the time ofc. > > In the meantime P3 calls dm_resume() it sees that the block device is > already claimed by a filesystem and calls into fs_bdev_thaw(). > > It takes a passive reference and realizes that the filesystem isn't > ready yet. So P3 puts itself to sleep to wait for the filesystem to > become ready. > > P2 puts the last active reference to the filesystem and marks it as > dying. > > Now P3 gets woken, sees that the filesystem is dying and > get_bdev_super() fails. > > So Darrick is correct about the fix but the reasoning is a bit > different. :) > > Patch appended and on #vfs.fixes. > From 35224b919d6778ca5dd11f76659ae849594bd2bf Mon Sep 17 00:00:00 2001 > From: Christian Brauner <brauner@xxxxxxxxxx> > Date: Thu, 13 Jun 2024 11:38:14 +0200 > Subject: [PATCH] fs: don't misleadingly warn during thaw operations > > The block device may have been frozen before it was claimed by a > filesystem. Concurrently another process might try to mount that > frozen block device and has temporarily claimed the block device for > that purpose causing a concurrent fs_bdev_thaw() to end up here. The > mounter is already about to abort mounting because they still saw an > elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return > NULL in that case. > > For example, P1 calls dm_suspend() which calls into bdev_freeze() before > the block device has been claimed by the filesystem. This brings > bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is > required. > > Now P2 tries to mount that frozen block device. It claims it and checks > bdev->bd_fsfreeze_count. As it's elevated it aborts mounting. > > In the meantime P3 calls dm_resume() it sees that the block device is > already claimed by a filesystem and calls into fs_bdev_thaw(). > > It takes a passive reference and realizes that the filesystem isn't > ready yet. So P3 puts itself to sleep to wait for the filesystem to > become ready. > > P2 puts the last active reference to the filesystem and marks it as > dying. Now P3 gets woken, sees that the filesystem is dying and > get_bdev_super() fails. Wow that's twisty. But it makes sense to me, so Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> --D > Fixes: 49ef8832fb1a ("bdev: implement freeze and thaw holder operations") > Cc: <stable@xxxxxxxxxxxxxxx> > Reported-by: Theodore Ts'o <tytso@xxxxxxx> > Link: https://lore.kernel.org/r/20240611085210.GA1838544@xxxxxxx > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > --- > fs/super.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/fs/super.c b/fs/super.c > index b72f1d288e95..095ba793e10c 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1502,8 +1502,17 @@ static int fs_bdev_thaw(struct block_device *bdev) > > lockdep_assert_held(&bdev->bd_fsfreeze_mutex); > > + /* > + * The block device may have been frozen before it was claimed by a > + * filesystem. Concurrently another process might try to mount that > + * frozen block device and has temporarily claimed the block device for > + * that purpose causing a concurrent fs_bdev_thaw() to end up here. The > + * mounter is already about to abort mounting because they still saw an > + * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return > + * NULL in that case. > + */ > sb = get_bdev_super(bdev); > - if (WARN_ON_ONCE(!sb)) > + if (!sb) > return -EINVAL; > > if (sb->s_op->thaw_super) > -- > 2.43.0 >