On Tue, Jul 25, 2023 at 02:35:17PM +0200, Christian Brauner wrote: > On Mon, Jul 24, 2023 at 10:51:45AM -0700, Christoph Hellwig wrote: > > From: Jan Kara <jack@xxxxxxx> > > > > Currently get_tree_bdev and mount_bdev open the block device before > > commiting to allocating a super block. This means the block device > > is opened even for bind mounts and other reuses of the super_block. > > > > That creates problems for restricting the number of writers to a device, > > and also leads to a unusual and not very helpful holder (the fs_type). > > > > Reorganize the mount code to first look whether the superblock for a > > particular device is already mounted and open the block device only if > > it is not. > > > > Signed-off-by: Jan Kara <jack@xxxxxxx> > > [hch: port to before the bdev_handle changes, > > duplicate the bdev read-only check from blkdev_get_by_path, > > extend the fsfree_mutex coverage to protect against freezes, > > fix an open bdev leak when the bdev is frozen, > > use the bdev local variable more, > > rename the s variable to sb to be more descriptive] > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > > --- > > > > So I promised to get a series that builds on top of this ready, but > > I'm way to busy and this will take a while. Getting this reworked > > version of Jan's patch out for everyone to use it as a based given > > that Christian is back from vacation, and I think Jan should be about > > back now as well. > > I'm in the middle of reviewing this. You're probably aware, but both > btrfs and nilfs at least still open the devices first since they > open-code their bdev and sb handling. I've removed the references to bind mounts from the commit message. I mentioned in [1] and [2] that this problem is really related to superblocks at it's core. It's just that technically a bind-mount would be created in the following scenario where two processes race to create a superblock: P1 P2 fd_fs = fsopen("ext4"); fd_fs = fsopen("ext4"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); // wins and creates superblock fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) // finds compatible superblock of P1 // spins until P1 sets SB_BORN and grabs a reference fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) P1 P2 fd_mnt1 = fsmount(fd_fs); fd_mnt2 = fsmount(fd_fs); move_mount(fd_mnt1, "/A") move_mount(fd_mnt2, "/B") If we forbid writes to a mounted block device like Jan's other patch did then before your's and Jan's patch P2 would fail at FSCONFIG_CMD_CREATE iirc. But bind-mounting itself isn't really broken. For example, even after P2 failed to create the superblock it could very well still do: mount --bind /A /C mount --bind /A /D or whatever as that never goes into get_tree again. The problem really is that stuff like: mount -t ext4 /dev/sda /E would be broken but the problem is that this request is arguably ambiguous when seen from userspace. It either means: (1) create a superblock and mount it at /E (2) create a bind-mount for an existing superblock at /E For P1 (1) is the case but for P2 (2) is the case. Most of the time users really want (1). Right now, we have no way for a user to be sure that (1) did take place aka that they did indeed create the superblock. That can be a problem in environments where you need to be sure that you did create the superblock with the correct options. Because for P2 all mount options that they set may well be completely ignored unless e.g., P1 did request rw and P2 did request ro. This is why I'd like to add something like FSCONFIG_CMD_CREATE_EXCL which would fail if the caller didn't actually create the superblock but found an existing one instead. [1]: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner [2]: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner