On Fri 18-08-23 16:00:50, Christian Brauner wrote: > Recent patches experiment with making it possible to allocate a new > superblock before opening the relevant block device. Naturally this has > intricate side-effects that we get to learn about while developing this. > > Superblock allocators such as sget{_fc}() return with s_umount of the > new superblock held and lock ordering currently requires that block > level locks such as bdev_lock and open_mutex rank above s_umount. > > Before aca740cecbe5 ("fs: open block device after superblock creation") > ordering was guaranteed to be correct as block devices were opened prior > to superblock allocation and thus s_umount wasn't held. But now s_umount > must be dropped before opening block devices to avoid locking > violations. > > This has consequences. The main one being that iterators over > @super_blocks and @fs_supers that grab a temporary reference to the > superblock can now also grab s_umount before the caller has managed to > open block devices and called fill_super(). So whereas before such > iterators or concurrent mounts would have simply slept on s_umount until > SB_BORN was set or the superblock was discard due to initalization > failure they can now needlessly spin through sget{_fc}(). > > If the caller is sleeping on bdev_lock or open_mutex one caller waiting > on SB_BORN will always spin somewhere and potentially this can go on for > quite a while. > > It should be possible to drop s_umount while allowing iterators to wait > on a nascent superblock to either be born or discarded. This patch > implements a wait_var_event() mechanism allowing iterators to sleep > until they are woken when the superblock is born or discarded. > > This also allows us to avoid relooping through @fs_supers and > @super_blocks if a superblock isn't yet born or dying. > > Link: aca740cecbe5 ("fs: open block device after superblock creation") > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> ... > +/* wake waiters */ > +#define SUPER_WAKE_FLAGS (SB_BORN | SB_DYING) > +static void super_wake(struct super_block *sb, unsigned int flag) > +{ > + WARN_ON_ONCE((flag & ~SUPER_WAKE_FLAGS)); > + WARN_ON_ONCE(hweight32(flag & SUPER_WAKE_FLAGS) > 1); > + > + /* > + * Pairs with smp_load_acquire() in super_lock() and > + * ensures that @flag is set before we wake anyone and ensures > + * that checking whether waitqueue is active isn't hoisted > + * before the store of @flag. > + */ > + sb->s_flags |= flag; > + smp_mb(); > + wake_up_var(&sb->s_flags); I think we misunderstood here. I believe we need: /* * Pairs with smp_load_acquire() in super_lock() to make sure * all initializations in the superblock are seen by the user * seeing SB_BORN sent. */ smp_store_release(&sb->s_flags, sb->s_flags | flag); /* * Pairs with the barrier in prepare_to_wait_event() to make sure * ___wait_var_event() either sees SB_BORN set or * waitqueue_active() check in wake_up_var() sees the waiter */ smp_rmb(); wake_up_var(&sb->s_flags); or we need something equivalent with stronger barriers. Like: smp_wmb(); sb->s_flags |= flag; smp_rmb(); wake_up_var(&sb->s_flags); > @@ -1715,7 +1813,7 @@ int freeze_super(struct super_block *sb) > int ret; > > atomic_inc(&sb->s_active); > - super_lock_excl(sb); > + __super_lock_excl(sb); > if (sb->s_writers.frozen != SB_UNFROZEN) { > deactivate_locked_super(sb); > return -EBUSY; > @@ -1737,7 +1835,7 @@ int freeze_super(struct super_block *sb) > /* Release s_umount to preserve sb_start_write -> s_umount ordering */ > super_unlock_excl(sb); > sb_wait_write(sb, SB_FREEZE_WRITE); > - super_lock_excl(sb); > + __super_lock_excl(sb); > > /* Now we go and block page faults... */ > sb->s_writers.frozen = SB_FREEZE_PAGEFAULT; > @@ -1820,7 +1918,7 @@ static int thaw_super_locked(struct super_block *sb) > */ > int thaw_super(struct super_block *sb) > { > - super_lock_excl(sb); > + __super_lock_excl(sb); > return thaw_super_locked(sb); > } Maybe we can have in these places rather: if (!super_lock_excl(sb)) WARN(1, "Dying superblock while freezing!"); So that we reduce the amount of __super_lock_excl() calls which are kind of special. In these places we hold active reference so practically this is equivalent. Just a though, pick whatever you find better, I don't have a strong opinion but wanted to share this idea. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR