On Thu, Aug 17, 2023 at 04:16:49PM +0200, Jan Kara wrote: > On Thu 17-08-23 15:24:54, Christian Brauner wrote: > > On Thu, Aug 17, 2023 at 02:50:21PM +0200, Jan Kara wrote: > > > On Thu 17-08-23 12:47:43, Christian Brauner wrote: > > > > Recent patches experiment with making it possible to allocate a new > > > > superblock before opening the relevant block device. Naturally this has > > > > intricate side-effects that we get to learn about while developing this. > > > > > > > > Superblock allocators such as sget{_fc}() return with s_umount of the > > > > new superblock held and ock ordering currently requires that block level > > > > locks such as bdev_lock and open_mutex rank above s_umount. > > > > > > > > Before aca740cecbe5 ("fs: open block device after superblock creation") > > > > ordering was guaranteed to be correct as block devices were opened prior > > > > to superblock allocation and thus s_umount wasn't held. But now s_umount > > > > must be dropped before opening block devices to avoid locking > > > > violations. > > > > > > > > This has consequences. The main one being that iterators over > > > > @super_blocks and @fs_supers that grab a temporary reference to the > > > > superblock can now also grab s_umount before the caller has managed to > > > > open block devices and called fill_super(). So whereas before such > > > > iterators or concurrent mounts would have simply slept on s_umount until > > > > SB_BORN was set or the superblock was discard due to initalization > > > > failure they can now needlessly spin through sget{_fc}(). > > > > > > > > If the caller is sleeping on bdev_lock or open_mutex one caller waiting > > > > on SB_BORN will always spin somewhere potentially this can go on for > > > ^^ and potentially? > > > > quite a while. > > > > > > > > It should be possible to drop s_umount while allowing iterators to wait > > > > on a nascent superblock to either be born or discarded. This patch > > > > implements a wait_var_event() mechanism allowing iterators to sleep > > > > until they are woken when the superblock is born or discarded. > > > > > > > > This should also allows us to avoid relooping through @fs_supers and > > > ^^^ superfluous "should" > > > > > > > @super_blocks if a superblock isn't yet born or dying. > > > > > > > > Link: aca740cecbe5 ("fs: open block device after superblock creation") > > > > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > > <snip> > > > > > @@ -841,15 +942,14 @@ struct super_block *get_active_super(struct block_device *bdev) > > > > if (!bdev) > > > > return NULL; > > > > > > > > -restart: > > > > spin_lock(&sb_lock); > > > > list_for_each_entry(sb, &super_blocks, s_list) { > > > > if (hlist_unhashed(&sb->s_instances)) > > > > continue; > > > > if (sb->s_bdev == bdev) { > > > > if (!grab_super(sb)) > > > > - goto restart; > > > > - super_unlock_write(sb); > > > > + return NULL; > > > Let me check whether I understand the rationale of this change: We found > > > a matching sb and it's SB_DYING. Instead of waiting for it to die and retry > > > the search (to likely not find anything) we just return NULL right away to > > > save us some trouble. > > > > Thanks for that question! I was missing something. Before these changes, > > when a superblock was unmounted and it hit deactivate_super() it could do: > > > > P1 P2 > > deactivate_locked_super() grab_super() > > -> if (!atomic_add_unless(&s->s_active, -1, 1)) > > -> super_lock_write() > > SB_BORN && !atomic_inc_add_unless(s->s_active) > > // fails, loop until it goes away > > -> super_lock_write() > > // remove sb from fs_supers > > I don't think this can happen as you describe it. deactivate_super() + > deactivate_locked_super() are written in a way so that the last s_active > reference is dropped while sb->s_umount is held for writing. And > grab_super() tries the increment under sb->s_umount as well. So either > grab_super() wins the race and deactivate_locked_super() just drops one > refcount and exits, or deactivate_locked_super() wins the race and > grab_super() can come only after the sb is shutdown. Then the increment > will fail and we'll loop as you describe. Perhaps that's what you meant, > just you've ordered things wrongly... Yes, sorry that's what I meant. > > > That can still happen in the new scheme so my patch needs a fix to wait > > for SB_DYING to be broadcast when no active reference can be acquired > > anymore because then we know that we're about to shut this down. Either > > that or spinning but I think we should just wait as we can now do that > > with my proposal. > > ... But in that case by the time grab_super() is able to get sb->s_umount > semaphore, SB_DYING is already set. Yeah, that's what I was about to reply right now after having thought about it. :) Thanks! > > > > > { > > > > - super_lock_read(sb); > > > > - if (!sb->s_root || > > > > - (sb->s_flags & (SB_ACTIVE | SB_BORN)) != (SB_ACTIVE | SB_BORN)) { > > > > + bool born = super_wait_read(sb); > > > > + > > > > + if (!born || !sb->s_root || !(sb->s_flags & SB_ACTIVE)) { > > > > super_unlock_read(sb); > > > > return false; > > > > } > > > > @@ -1572,7 +1674,7 @@ int vfs_get_tree(struct fs_context *fc) > > > > * flag. > > > > */ > > > > smp_wmb(); > > > > > > Is the barrier still needed here when super_wake() has smp_store_release()? > > > > I wasn't sure. The barrier tries to ensure that everything before > > SB_BORN is seen by super_cache_count(). Whereas the smp_store_release() > > really is about the flag. Maybe the smp_wmb() would be sufficient for > > that but since I wasn't sure the additional smp_store_release() is way > > more obvious imho. > > I was looking into memory-barriers.txt and it has: > > (6) RELEASE operations. > > This also acts as a one-way permeable barrier. It guarantees that all > memory operations before the RELEASE operation will appear to happen > before the RELEASE operation with respect to the other components of the > system. > > Which sounds like smp_store_release() of SB_BORN should be enough to make > super_cache_count() see all the stores before it if it sees SB_BORN set... Ok, thanks for checking! > > > > > - sb->s_flags |= SB_BORN; > > > > + super_wake(sb, SB_BORN); > > > > > > I'm also kind of wondering whether when we have SB_BORN and SB_DYING isn't > > > the SB_ACTIVE flag redundant. SB_BORN is set practically at the same moment > > > as SB_ACTIVE. SB_ACTIVE gets cleared somewhat earlier than SB_DYING is set > > > but I believe SB_DYING can be set earlier (after all by the time SB_ACTIVE > > > is cleared we have sb->s_root == NULL which basically stops most of the > > > places looking at superblocks. As I'm grepping we've grown a lot of > > > SB_ACTIVE handling all over the place so this would be a bit non-trivial > > > but I belive it will make it easier for filesystem developers to decide > > > which flag they should be using... Also we could then drop sb->s_root > > > checks from many places because the locking helpers will return false if > > > SB_DYING is set. > > > > Certainly something to explore but no promises. Would you be open to > > doig this as a follow-up patch? If you have a clearer idea here then I > > wouldn't mind you piling this on top of this series even. > > Sure, the cleanup of SB_ACTIVE probably deserves a separate patchset > because it's going to involve a lot of individual filesystem changes. Yeah, agreed.