On Wed, Sep 06, 2023 at 03:26:21PM +0200, Mikulas Patocka wrote: > lvm may suspend any logical volume anytime. If lvm suspend races with > unmount, it may be possible that the kernel writes to the filesystem after > unmount successfully returned. The problem can be demonstrated with this > script: > > #!/bin/sh -ex > modprobe brd rd_size=4194304 > vgcreate vg /dev/ram0 > lvcreate -L 16M -n lv vg > mkfs.ext4 /dev/vg/lv > mount -t ext4 /dev/vg/lv /mnt/test > dmsetup suspend /dev/vg/lv > (sleep 1; dmsetup resume /dev/vg/lv) & > umount /mnt/test > md5sum /dev/vg/lv > md5sum /dev/vg/lv > dmsetup remove_all > rmmod brd > > The script unmounts the filesystem and runs md5sum twice, the result is > that these two invocations return different hash. > > What happens: > * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it > increments sb->s_active > * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls > deactivate_super, deactivate_super sees that sb->s_active is 2, so it > decreases it to 1 and does nothing - the umount syscall returns > successfully > * now we have a mounted filesystem despite the fact that umount returned That can happen for any number of reasons. Other codepaths might very well still hold active references to the superblock. The same thing can happen when you have your filesystem pinned in another mount namespace. If you really want to be absolutely sure that the superblock is destroyed you must use a mechanism like fanotify which allows you to get notified on superblock destruction. > @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn > } > fsnotify_vfsmount_delete(&mnt->mnt); > dput(mnt->mnt.mnt_root); > - deactivate_super(mnt->mnt.mnt_sb); > + wait_and_deactivate_super(mnt->mnt.mnt_sb); Your patch means that we hang on any umount when the filesystem is frozen. IOW, you'd also hang on any umount of a bind-mount. IOW, every single container making use of this filesystems via bind-mounts would hang on umount and shutdown. You'd effectively build a deadlock trap for userspace when the filesystem is frozen. And nothing can make progress until that thing is thawed. Umount can't block if the block device is frozen. > mnt_free_id(mnt); > call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt); > } > Index: linux-2.6/fs/super.c > =================================================================== > --- linux-2.6.orig/fs/super.c 2023-09-05 21:09:16.000000000 +0200 > +++ linux-2.6/fs/super.c 2023-09-06 09:52:20.000000000 +0200 > @@ -36,6 +36,7 @@ > #include <linux/lockdep.h> > #include <linux/user_namespace.h> > #include <linux/fs_context.h> > +#include <linux/delay.h> > #include <uapi/linux/mount.h> > #include "internal.h" > > @@ -365,6 +366,25 @@ void deactivate_super(struct super_block > EXPORT_SYMBOL(deactivate_super); > > /** > + * wait_and_deactivate_super - wait for unfreeze and drop a reference > + * @s: superblock to deactivate > + * > + * Variant of deactivate_super(), except that we wait if the filesystem is > + * frozen. This is required on unmount, to make sure that the filesystem is > + * really unmounted when this function returns. > + */ > +void wait_and_deactivate_super(struct super_block *s) > +{ > + down_write(&s->s_umount); > + while (s->s_writers.frozen != SB_UNFROZEN) { > + up_write(&s->s_umount); > + msleep(1); > + down_write(&s->s_umount); > + } > + deactivate_locked_super(s); That msleep(1) alone is a pretty nasty hack. We should definitely not spin in code like this. That superblock could stay frozen for a long time without s_umount held. So this is spinning. Even if we wanted to do this it would need to use a similar wait mechanism for the filesystem to be thawed like we do in thaw_super_locked().