Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Thu, 14 Sep 2023 17:58:05 +0100

On Thu, Sep 14, 2023 at 04:02:25PM +0200, Christian Brauner wrote:

> Yes, you're right that making the superblock and not the filesytem type
> the bd_holder changes the logic and we are aware of that of course. And
> it requires changes such as moving additional block device closing from
> where some callers currently do it.

Details, please?

> But the filesytem type is not a very useful holder itself and has other
> drawbacks. The obvious one being that it requires us to wade through all
> superblocks on the system trying to find the superblock associated with
> a given block device continously grabbing and dropping sb_lock and
> s_umount. None of that is very pleasant nor elegant and it is for sure
> not very easy to understand (Plus, it's broken for btrfs freezing and
> syncing via block level ioctls.).

"Constantly" is a bit of a stretch - IIRC, we grabbed sb_lock once, then
went through the list comparing ->s_bdev (without any extra locking),
then bumped ->s_count on the found superblock, dropped sb_lock,
grabbed ->s_umount on the sucker and verified it's still alive.

Repeated grabbing of any lock happened only on a race with fs shutdown;
normal case is one spin_lock, one spin_unlock, one down_read().

Oh, well...

> Using the superblock as holder makes this go away and is overall a lot
> more useful and intuitive and can be extended to filesystems with
> multiple devices (Of which we apparently are bound to get more.).
>
> So I think this change is worth the pain.
> 
> It's a fair point that these lifetime rules should be documented in
> Documentation/filesystems/. The old lifetime documentation is too sparse
> to be useful though.

What *are* these lifetime rules?  Seriously, you have 3 chunks of
fs-dependent actions at the moment:
	* the things needed to get rid of internal references pinning
inodes/dentries + whatever else we need done before generic_shutdown_super()
	* the stuff to be done between generic_shutdown_super() and
making the sucker invisible to sget()/sget_fc()
	* the stuff that must be done after we are sure that sget
callbacks won't be looking at this instance.

Note that Christoph's series has mashed (2) and (3) together, resulting
in UAF in a bunch of places.  And I'm dead serious about
Documentation/filesystems/porting being the right place; any development
tree of any filesystem (in-tree one or not) will have to go through the
changes and figure out WTF to do with their existing code.  We are
going to play whack-a-mole for at least several years as development
branches get rebased and merged.

Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
re making sure that anything in superblock that might be needed by methods
called in RCU mode should *not* be freed without an RCU delay...  Should've
done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
is, today we have several filesystems with exact same kind of breakage.
hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
and ntfs3 - introduced later, by initial merges of filesystems in question.
Missed on review...

Hell knows - perhaps Documentation/filesystems/whack-a-mole might be a good
idea...