On Thu, Mar 14, 2024 at 12:10:59PM +0100, Christian Brauner wrote: > On Tue, Mar 12, 2024 at 07:32:35PM -0700, Christoph Hellwig wrote: > > Now that this is in mainline it seems to cause blktests to crash > > nbd/003 with a rather non-obvious oops for me: > > Ok, will be looking into that next. Ok, I know what's going on. Basically, fput() on the block device runs asynchronously which means that bdev->bd_holder can still be set to @sb after it has already been freed. Let me illustrate what I mean: P1 P2 mount(sb) fd = open("/dev/nbd", ...) -> file = bdev_file_open_by_dev(..., sb, ...) bdev->bd_holder = sb; // Later on: umount(sb) ->kill_block_super(sb) |----> [fput() -> deferred via task work] -> put_super(sb) -> frees the sb via rcu | | nbd_ioctl(NBD_CLEAR_SOCK) | -> disk_force_media_change() | -> bdev_mark_dead() | -> fs_mark_dead() | // Finds bdev->bd_holder == sb |-> file->release::bdev_release() // Tries to get reference to it but it's freed; frees it again bdev->bd_holder = NULL; Two solutions that come to my mind: [1] Indicate to fput() that this is an internal block devices open and thus just close it synchronously. This is fine. First, because the block device superblock is never unmounted or anything so there's no risk from hanging there for any reason. Second, bdev_release() also ran synchronously so any issue that we'd see from calling file->f_op->release::bdev_release() we would have seen from bdev_release() itself. See [1.1] for a patch. (2) Take a temporary reference to the holder when opening the block device. This is also fine afaict because we differentiate between passive and active references on superblocks and what we already do in fs_bdev_mark_dead() and friends. This mean we don't have to mess around with fput(). See [1.2] for a patch. (3) Revert and cry. No patch. Personally, I think (2) is more desirable because we don't lose the async property of fput() and we don't need to have a new FMODE_* flag. I'd like to do some more testing with this. Thoughts? [1.1]: Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> --- block/bdev.c | 1 + fs/file_table.c | 5 +++++ include/linux/fs.h | 3 +++ 3 files changed, 9 insertions(+) diff --git a/block/bdev.c b/block/bdev.c index e7adaaf1c219..d0c208a04b04 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -969,6 +969,7 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder, return bdev_file; } ihold(bdev->bd_inode); + bdev_file->f_mode |= FMODE_BLOCKDEV; ret = bdev_open(bdev, mode, holder, hops, bdev_file); if (ret) { diff --git a/fs/file_table.c b/fs/file_table.c index 4f03beed4737..48d35dd67020 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -473,6 +473,11 @@ void fput(struct file *file) if (atomic_long_dec_and_test(&file->f_count)) { struct task_struct *task = current; + if (unlikely((file->f_mode & FMODE_BLOCKDEV))) { + __fput(file); + return; + } + if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) { file_free(file); return; diff --git a/include/linux/fs.h b/include/linux/fs.h index d5d5a4ee24f0..ceac9c0316a6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -121,6 +121,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, #define FMODE_PWRITE ((__force fmode_t)0x10) /* File is opened for execution with sys_execve / sys_uselib */ #define FMODE_EXEC ((__force fmode_t)0x20) + +/* File is opened as block device. */ +#define FMODE_BLOCKDEV ((__force fmode_t)0x100) /* 32bit hashes as llseek() offset (for directories) */ #define FMODE_32BITHASH ((__force fmode_t)0x200) /* 64bit hashes as llseek() offset (for directories) */ -- 2.43.0 [1.2]: Sketched-by: Christian Brauner <brauner@xxxxxxxxxx> --- block/bdev.c | 4 ++++ fs/super.c | 17 +++++++++++++++++ include/linux/blkdev.h | 3 +++ 3 files changed, 24 insertions(+) diff --git a/block/bdev.c b/block/bdev.c index e7adaaf1c219..a0d5960dc2b9 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -627,6 +627,8 @@ static void bd_end_claim(struct block_device *bdev, void *holder) whole->bd_holder = NULL; mutex_unlock(&bdev_lock); + if (bdev->bd_holder_ops && bdev->bd_holder_ops->put_holder) + bdev->bd_holder_ops->put_holder(holder); /* * If this was the last claim, remove holder link and unblock evpoll if * it was a write holder. @@ -902,6 +904,8 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder, bdev_file->f_mode |= FMODE_NOWAIT; bdev_file->f_mapping = bdev->bd_inode->i_mapping; bdev_file->f_wb_err = filemap_sample_wb_err(bdev_file->f_mapping); + if (hops && hops->get_holder) + hops->get_holder(holder); bdev_file->private_data = holder; return 0; diff --git a/fs/super.c b/fs/super.c index ee05ab6b37e7..64dbbdbed93a 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1515,11 +1515,28 @@ static int fs_bdev_thaw(struct block_device *bdev) return error; } +static void fs_bdev_super_get(void *data) +{ + struct super_block *sb = data; + + spin_lock(&sb_lock); + sb->s_count++; + spin_unlock(&sb_lock); +} + +static void fs_bdev_super_put(void *data) +{ + struct super_block *sb = data; + put_super(sb); +} + const struct blk_holder_ops fs_holder_ops = { .mark_dead = fs_bdev_mark_dead, .sync = fs_bdev_sync, .freeze = fs_bdev_freeze, .thaw = fs_bdev_thaw, + .get_holder = fs_bdev_super_get, + .put_holder = fs_bdev_super_put, }; EXPORT_SYMBOL_GPL(fs_holder_ops); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f9b87c39cab0..d919e8bcb2c1 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1505,6 +1505,9 @@ struct blk_holder_ops { * Thaw the file system mounted on the block device. */ int (*thaw)(struct block_device *bdev); + + void (*get_holder)(void *holder); + void (*put_holder)(void *holder); }; /* -- 2.43.0