On Thu, Mar 14, 2024 at 03:47:52PM +0100, Christian Brauner wrote: > On Thu, Mar 14, 2024 at 12:10:59PM +0100, Christian Brauner wrote: > > On Tue, Mar 12, 2024 at 07:32:35PM -0700, Christoph Hellwig wrote: > > > Now that this is in mainline it seems to cause blktests to crash > > > nbd/003 with a rather non-obvious oops for me: > > > > Ok, will be looking into that next. > > Ok, I know what's going on. Basically, fput() on the block device runs > asynchronously which means that bdev->bd_holder can still be set to @sb > after it has already been freed. Let me illustrate what I mean: > > P1 P2 > mount(sb) fd = open("/dev/nbd", ...) > -> file = bdev_file_open_by_dev(..., sb, ...) > bdev->bd_holder = sb; > > // Later on: > > umount(sb) > ->kill_block_super(sb) > |----> [fput() -> deferred via task work] > -> put_super(sb) -> frees the sb via rcu > | > | nbd_ioctl(NBD_CLEAR_SOCK) > | -> disk_force_media_change() > | -> bdev_mark_dead() > | -> fs_mark_dead() > | // Finds bdev->bd_holder == sb > |-> file->release::bdev_release() // Tries to get reference to it but it's freed; frees it again > bdev->bd_holder = NULL; > > Two solutions that come to my mind: > > [1] Indicate to fput() that this is an internal block devices open and > thus just close it synchronously. This is fine. First, because the block > device superblock is never unmounted or anything so there's no risk > from hanging there for any reason. Second, bdev_release() also ran > synchronously so any issue that we'd see from calling > file->f_op->release::bdev_release() we would have seen from > bdev_release() itself. See [1.1] for a patch. > > (2) Take a temporary reference to the holder when opening the block > device. This is also fine afaict because we differentiate between > passive and active references on superblocks and what we already do > in fs_bdev_mark_dead() and friends. This mean we don't have to mess > around with fput(). See [1.2] for a patch. > > (3) Revert and cry. No patch. > > Personally, I think (2) is more desirable because we don't lose the > async property of fput() and we don't need to have a new FMODE_* flag. > I'd like to do some more testing with this. Thoughts? > > [1.1]: > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > --- > block/bdev.c | 1 + > fs/file_table.c | 5 +++++ > include/linux/fs.h | 3 +++ > 3 files changed, 9 insertions(+) > > diff --git a/block/bdev.c b/block/bdev.c > index e7adaaf1c219..d0c208a04b04 100644 > --- a/block/bdev.c > +++ b/block/bdev.c > @@ -969,6 +969,7 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder, > return bdev_file; > } > ihold(bdev->bd_inode); > + bdev_file->f_mode |= FMODE_BLOCKDEV; > > ret = bdev_open(bdev, mode, holder, hops, bdev_file); > if (ret) { > diff --git a/fs/file_table.c b/fs/file_table.c > index 4f03beed4737..48d35dd67020 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -473,6 +473,11 @@ void fput(struct file *file) > if (atomic_long_dec_and_test(&file->f_count)) { > struct task_struct *task = current; > > + if (unlikely((file->f_mode & FMODE_BLOCKDEV))) { > + __fput(file); > + return; > + } > + > if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) { > file_free(file); > return; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index d5d5a4ee24f0..ceac9c0316a6 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -121,6 +121,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, > #define FMODE_PWRITE ((__force fmode_t)0x10) > /* File is opened for execution with sys_execve / sys_uselib */ > #define FMODE_EXEC ((__force fmode_t)0x20) > + > +/* File is opened as block device. */ > +#define FMODE_BLOCKDEV ((__force fmode_t)0x100) > /* 32bit hashes as llseek() offset (for directories) */ > #define FMODE_32BITHASH ((__force fmode_t)0x200) > /* 64bit hashes as llseek() offset (for directories) */ > -- > 2.43.0 > > [1.2]: > Sketched-by: Christian Brauner <brauner@xxxxxxxxxx> > --- > block/bdev.c | 4 ++++ > fs/super.c | 17 +++++++++++++++++ > include/linux/blkdev.h | 3 +++ > 3 files changed, 24 insertions(+) > > diff --git a/block/bdev.c b/block/bdev.c > index e7adaaf1c219..a0d5960dc2b9 100644 > --- a/block/bdev.c > +++ b/block/bdev.c > @@ -627,6 +627,8 @@ static void bd_end_claim(struct block_device *bdev, void *holder) > whole->bd_holder = NULL; > mutex_unlock(&bdev_lock); > > + if (bdev->bd_holder_ops && bdev->bd_holder_ops->put_holder) > + bdev->bd_holder_ops->put_holder(holder); That should move into bdev_release() obviously so it mirrors bdev_open(). Plus, currently it's a nop because we just NULLed bdev->bd_holder_ops above. But you get the idea, I hope. > /* > * If this was the last claim, remove holder link and unblock evpoll if > * it was a write holder. > @@ -902,6 +904,8 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder, > bdev_file->f_mode |= FMODE_NOWAIT; > bdev_file->f_mapping = bdev->bd_inode->i_mapping; > bdev_file->f_wb_err = filemap_sample_wb_err(bdev_file->f_mapping); > + if (hops && hops->get_holder) > + hops->get_holder(holder); > bdev_file->private_data = holder; > > return 0; > diff --git a/fs/super.c b/fs/super.c > index ee05ab6b37e7..64dbbdbed93a 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1515,11 +1515,28 @@ static int fs_bdev_thaw(struct block_device *bdev) > return error; > } > > +static void fs_bdev_super_get(void *data) > +{ > + struct super_block *sb = data; > + > + spin_lock(&sb_lock); > + sb->s_count++; > + spin_unlock(&sb_lock); > +} > + > +static void fs_bdev_super_put(void *data) > +{ > + struct super_block *sb = data; > + put_super(sb); > +} > + > const struct blk_holder_ops fs_holder_ops = { > .mark_dead = fs_bdev_mark_dead, > .sync = fs_bdev_sync, > .freeze = fs_bdev_freeze, > .thaw = fs_bdev_thaw, > + .get_holder = fs_bdev_super_get, > + .put_holder = fs_bdev_super_put, > }; > EXPORT_SYMBOL_GPL(fs_holder_ops); > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index f9b87c39cab0..d919e8bcb2c1 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1505,6 +1505,9 @@ struct blk_holder_ops { > * Thaw the file system mounted on the block device. > */ > int (*thaw)(struct block_device *bdev); > + > + void (*get_holder)(void *holder); > + void (*put_holder)(void *holder); > }; > > /* > -- > 2.43.0 >