On Fri 18-08-23 14:46:57, Christian Brauner wrote: > On Fri, Aug 18, 2023 at 02:02:15PM +0200, Jan Kara wrote: > > On Fri 18-08-23 12:54:17, Christian Brauner wrote: > > > Recent patches experiment with making it possible to allocate a new > > > superblock before opening the relevant block device. Naturally this has > > > intricate side-effects that we get to learn about while developing this. > > > > > > Superblock allocators such as sget{_fc}() return with s_umount of the > > > new superblock held and lock ordering currently requires that block > > > level locks such as bdev_lock and open_mutex rank above s_umount. > > > > > > Before aca740cecbe5 ("fs: open block device after superblock creation") > > > ordering was guaranteed to be correct as block devices were opened prior > > > to superblock allocation and thus s_umount wasn't held. But now s_umount > > > must be dropped before opening block devices to avoid locking > > > violations. > > > > > > This has consequences. The main one being that iterators over > > > @super_blocks and @fs_supers that grab a temporary reference to the > > > superblock can now also grab s_umount before the caller has managed to > > > open block devices and called fill_super(). So whereas before such > > > iterators or concurrent mounts would have simply slept on s_umount until > > > SB_BORN was set or the superblock was discard due to initalization > > > failure they can now needlessly spin through sget{_fc}(). > > > > > > If the caller is sleeping on bdev_lock or open_mutex one caller waiting > > > on SB_BORN will always spin somewhere and potentially this can go on for > > > quite a while. > > > > > > It should be possible to drop s_umount while allowing iterators to wait > > > on a nascent superblock to either be born or discarded. This patch > > > implements a wait_var_event() mechanism allowing iterators to sleep > > > until they are woken when the superblock is born or discarded. > > > > > > This also allows us to avoid relooping through @fs_supers and > > > @super_blocks if a superblock isn't yet born or dying. > > > > > > Link: aca740cecbe5 ("fs: open block device after superblock creation") > > > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > > > > Looks mostly good to me. I've spotted only a couple of nits and one > > possible memory ordering issue... ... > > > +/* wait and acquire read-side of @sb->s_umount */ > > > +static inline bool super_lock_shared(struct super_block *sb) > > > +{ > > > + return super_lock(sb, false); > > > +} > > > + > > > +/* wait and acquire write-side of @sb->s_umount */ > > > +static inline bool super_lock_excl(struct super_block *sb) > > > +{ > > > + return super_lock(sb, true); > > > +} > > > + > > > +/* wake waiters */ > > > +#define SUPER_WAKE_FLAGS (SB_BORN | SB_DYING) > > > +static void super_wake(struct super_block *sb, unsigned int flag) > > > +{ > > > + unsigned int flags = sb->s_flags; > > > + > > > + WARN_ON_ONCE((flag & ~SUPER_WAKE_FLAGS)); > > > + WARN_ON_ONCE(hweight32(flag & SUPER_WAKE_FLAGS) > 1); > > > > Maybe assert here that s_umount is held? > > I think that should be asserted in callers because we don't hold it when > we do wake SB_DEAD in deactivate_locked_super() because we don't have or > need it. Right, I've realized that after I've sent this. > > > + > > > + /* > > > + * Pairs with smp_load_acquire() in super_lock() and > > > + * ensures that @flag is set before we wake anyone. > > > + */ > > > + smp_store_release(&sb->s_flags, flags | flag); > > > + wake_up_var(&sb->s_flags); > > > > As I'm thinking about it now, we may need at least a smp_rmb() between the > > store and wake_up_var(). What I'm worried about is the following: > > > > TASK1 TASK2 > > super_wake() super_lock() > > check s_flags, SB_BORN not set yet > > waitqueue_active() from wake_up_var() > > which got reordered by the CPU before > > smp_store_release(). This seems possible > > because release is a one-way permeable in > > this direction. > > wait_var_event(..) > > prepare_to_wait_event() > > wait_born() > > SB_BORN still not set => sleep > > smp_store_release() sets SB_BORN > > wake_up_var() does nothing because it thinks > > the waitqueue is empty. > > Then I propse we use smp_mb() here similar to what we do for __I_NEW. > Does that sounds ok? Sure, fine by me. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR