On Thu, Oct 26, 2023 at 12:35:03PM +0200, Jan Kara wrote: > On Wed 25-10-23 22:46:29, Christian Brauner wrote: > > On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote: > > > Hello! > > > > > > On Tue 24-10-23 16:53:38, Christian Brauner wrote: > > > > This is a mechanism that allows the holder of a block device to yield > > > > device access before actually closing the block device. > > > > > > > > If a someone yields a device then any concurrent opener claiming the > > > > device exclusively with the same blk_holder_ops as the current owner can > > > > wait for the device to be given up. Filesystems by default use > > > > fs_holder_ps and so can wait on each other. > > > > > > > > This mechanism allows us to simplify superblock handling quite a bit at > > > > the expense of requiring filesystems to yield devices. A filesytems must > > > > yield devices under s_umount. This allows costly work to be done outside > > > > of s_umount. > > > > > > > > There's nothing wrong with the way we currently do things but this does > > > > allow us to simplify things and kills a whole class of theoretical UAF > > > > when walking the superblock list. > > > > > > I'm not sure why is it better to create new ->yield callback called under > > > sb->s_umount rather than just move blkdev_put() calls back into > > > ->put_super? Or at least yielding could be done in ->put_super instead of > > > > The main reason was to not call potentially expensive > > blkdev_put()/bdev_release() under s_umount. If we don't care about this > > though then this shouldn't be a problem. > > So I would not be really concerned about performance here. The superblock > is dying, nobody can do anything about that until the superblock is fully > dead and cleaned up. Maybe some places could skip such superblocks more > quickly but so far I'm not convinced it matters in practice (e.g. writeback > holds s_umount over the whole sync(1) time and nobody complains). And as > you mention below, we have been doing this for a long time without anybody > really complaining. Ok, that's a good point. > > > And yes, then we need to move > > blkdev_put()/bdev_release() under s_umount including the main block > > device. IOW, we need to ensure that all bdev calls are done under > > s_umount before we remove the superblock from the instance list. > > This is about those seemingly spurious "device busy" errors when the > superblock hasn't closed its devices yet, isn't it? But we now remove Yes, because we tie superblock and block device neatly together. > superblock from s_instances list in kill_super_notify() and until that > moment SB_DYING is protecting us from racing mounts. So is there some other > problem? No, there isn't a problem at all. It's all working fine but it was initially a little annoying as we had to update filesystems to ensure that sb->s_fs_info is kept alive. But it's all fixed. The possible advantage is that if we drop all block devices under s_umount then we can remove the superblock from fs_type->s_instances in the old location again. I'm not convinced it's worth it but it's a possible simplification. I'm not even arguing it's objectively better I think it's a matter of taste in the end. > > > I think > > that should be fine but I wanted to propose an alternative to that as > > well: cheap mark-for-release under s_umount and heavy-duty without > > s_umount. But I guess it doesn't matter because most filesystems did use > > to close devices under s_umount before anyway. Let me know what you > > think makes the most sense. > > I think we should make it as simple as possible for filesystems. As I said Yes, I fully agree. The really good thing about the current mechanism is that it's completely vfs-only. With s_umount/open_mutex ordering fixed filesystems can now close block devices wherever and the vfs will ensure that you don't get spurious ebusy. And the filesystem doesn't have to know a thing about it or take any care. > above I don't think s_umount hold time really matters so the only thing > that limits us are locking constraints. During superblock shutdown the only > thing that currently cannot be done under s_umount (that I'm aware of) is the > teardown of the sb->s_bdi because that waits for writeback threads and > those can be blocked waiting for s_umount. Which we don't need to change if it's working fine. > > So if we wanted we could just move ext4 & xfs bdev closing back into > ->put_super to avoid ext4_kill_sb() and somewhat slim down xfs_mount_free() > but otherwise I don't see much space for simplification or need for adding > more callbacks :) I mean, that I would only think is useful if we really wanted to close all block devices under s_umount to possible be remove the waiting mechanism we have right now. Otherwise I think it's churn for the sake of churn and I quite like what we have right now.