Re: [PATCH RFC 0/6] fs,block: yield devices

Christian Brauner <brauner@xxxxxxxxxx> · Thu, 26 Oct 2023 14:07:38 +0200

On Thu, Oct 26, 2023 at 12:35:03PM +0200, Jan Kara wrote:
> On Wed 25-10-23 22:46:29, Christian Brauner wrote:
> > On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> > > > This is a mechanism that allows the holder of a block device to yield
> > > > device access before actually closing the block device.
> > > > 
> > > > If a someone yields a device then any concurrent opener claiming the
> > > > device exclusively with the same blk_holder_ops as the current owner can
> > > > wait for the device to be given up. Filesystems by default use
> > > > fs_holder_ps and so can wait on each other.
> > > > 
> > > > This mechanism allows us to simplify superblock handling quite a bit at
> > > > the expense of requiring filesystems to yield devices. A filesytems must
> > > > yield devices under s_umount. This allows costly work to be done outside
> > > > of s_umount.
> > > > 
> > > > There's nothing wrong with the way we currently do things but this does
> > > > allow us to simplify things and kills a whole class of theoretical UAF
> > > > when walking the superblock list.
> > > 
> > > I'm not sure why is it better to create new ->yield callback called under
> > > sb->s_umount rather than just move blkdev_put() calls back into
> > > ->put_super? Or at least yielding could be done in ->put_super instead of
> > 
> > The main reason was to not call potentially expensive
> > blkdev_put()/bdev_release() under s_umount. If we don't care about this
> > though then this shouldn't be a problem.
> 
> So I would not be really concerned about performance here. The superblock
> is dying, nobody can do anything about that until the superblock is fully
> dead and cleaned up. Maybe some places could skip such superblocks more
> quickly but so far I'm not convinced it matters in practice (e.g. writeback
> holds s_umount over the whole sync(1) time and nobody complains). And as
> you mention below, we have been doing this for a long time without anybody
> really complaining.

Ok, that's a good point.

> 
> > And yes, then we need to move
> > blkdev_put()/bdev_release() under s_umount including the main block
> > device. IOW, we need to ensure that all bdev calls are done under
> > s_umount before we remove the superblock from the instance list.
> 
> This is about those seemingly spurious "device busy" errors when the
> superblock hasn't closed its devices yet, isn't it?  But we now remove

Yes, because we tie superblock and block device neatly together.

> superblock from s_instances list in kill_super_notify() and until that
> moment SB_DYING is protecting us from racing mounts. So is there some other
> problem?

No, there isn't a problem at all. It's all working fine but it was
initially a little annoying as we had to update filesystems to ensure
that sb->s_fs_info is kept alive. But it's all fixed.

The possible advantage is that if we drop all block devices under
s_umount then we can remove the superblock from fs_type->s_instances in
the old location again. I'm not convinced it's worth it but it's a
possible simplification. I'm not even arguing it's objectively better I
think it's a matter of taste in the end.

> 
> > I think
> > that should be fine but I wanted to propose an alternative to that as
> > well: cheap mark-for-release under s_umount and heavy-duty without
> > s_umount. But I guess it doesn't matter because most filesystems did use
> > to close devices under s_umount before anyway. Let me know what you
> > think makes the most sense.
> 
> I think we should make it as simple as possible for filesystems. As I said

Yes, I fully agree.

The really good thing about the current mechanism is that it's
completely vfs-only. With s_umount/open_mutex ordering fixed filesystems
can now close block devices wherever and the vfs will ensure that you
don't get spurious ebusy. And the filesystem doesn't have to know a
thing about it or take any care.

> above I don't think s_umount hold time really matters so the only thing
> that limits us are locking constraints. During superblock shutdown the only
> thing that currently cannot be done under s_umount (that I'm aware of) is the
> teardown of the sb->s_bdi because that waits for writeback threads and
> those can be blocked waiting for s_umount.

Which we don't need to change if it's working fine.

> 
> So if we wanted we could just move ext4 & xfs bdev closing back into
> ->put_super to avoid ext4_kill_sb() and somewhat slim down xfs_mount_free()
> but otherwise I don't see much space for simplification or need for adding
> more callbacks :)

I mean, that I would only think is useful if we really wanted to close
all block devices under s_umount to possible be remove the waiting
mechanism we have right now. Otherwise I think it's churn for the sake
of churn and I quite like what we have right now.