Re: [PATCH v2 01/34] bdev: open block device as files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 14, 2024 at 03:47:52PM +0100, Christian Brauner wrote:
> On Thu, Mar 14, 2024 at 12:10:59PM +0100, Christian Brauner wrote:
> > On Tue, Mar 12, 2024 at 07:32:35PM -0700, Christoph Hellwig wrote:
> > > Now that this is in mainline it seems to cause blktests to crash
> > > nbd/003 with a rather non-obvious oops for me:
> > 
> > Ok, will be looking into that next.
> 
> Ok, I know what's going on. Basically, fput() on the block device runs
> asynchronously which means that bdev->bd_holder can still be set to @sb
> after it has already been freed. Let me illustrate what I mean:
> 
> P1                                                 P2
> mount(sb)                                          fd = open("/dev/nbd", ...)
> -> file = bdev_file_open_by_dev(..., sb, ...)
>    bdev->bd_holder = sb;
> 
> // Later on:
> 
> umount(sb)
> ->kill_block_super(sb)
> |----> [fput() -> deferred via task work]
> -> put_super(sb) -> frees the sb via rcu
> |
> |                                                 nbd_ioctl(NBD_CLEAR_SOCK)
> |                                                 -> disk_force_media_change()
> |                                                    -> bdev_mark_dead()
> |                                                       -> fs_mark_dead()
> |                                                          // Finds bdev->bd_holder == sb
> |-> file->release::bdev_release()                          // Tries to get reference to it but it's freed; frees it again
>    bdev->bd_holder = NULL;
> 
> Two solutions that come to my mind:
> 
> [1] Indicate to fput() that this is an internal block devices open and
>     thus just close it synchronously. This is fine. First, because the block
>     device superblock is never unmounted or anything so there's no risk
>     from hanging there for any reason. Second, bdev_release() also ran
>     synchronously so any issue that we'd see from calling
>     file->f_op->release::bdev_release() we would have seen from
>     bdev_release() itself. See [1.1] for a patch.
> 
> (2) Take a temporary reference to the holder when opening the block
>     device. This is also fine afaict because we differentiate between
>     passive and active references on superblocks and what we already do
>     in fs_bdev_mark_dead() and friends. This mean we don't have to mess
>     around with fput(). See [1.2] for a patch.
> 
> (3) Revert and cry. No patch.
> 
> Personally, I think (2) is more desirable because we don't lose the
> async property of fput() and we don't need to have a new FMODE_* flag.
> I'd like to do some more testing with this. Thoughts?
> 
> [1.1]:
> Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
> ---
>  block/bdev.c       | 1 +
>  fs/file_table.c    | 5 +++++
>  include/linux/fs.h | 3 +++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index e7adaaf1c219..d0c208a04b04 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -969,6 +969,7 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
>  		return bdev_file;
>  	}
>  	ihold(bdev->bd_inode);
> +	bdev_file->f_mode |= FMODE_BLOCKDEV;
>  
>  	ret = bdev_open(bdev, mode, holder, hops, bdev_file);
>  	if (ret) {
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 4f03beed4737..48d35dd67020 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -473,6 +473,11 @@ void fput(struct file *file)
>  	if (atomic_long_dec_and_test(&file->f_count)) {
>  		struct task_struct *task = current;
>  
> +		if (unlikely((file->f_mode & FMODE_BLOCKDEV))) {
> +			__fput(file);
> +			return;
> +		}
> +
>  		if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
>  			file_free(file);
>  			return;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index d5d5a4ee24f0..ceac9c0316a6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -121,6 +121,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  #define FMODE_PWRITE		((__force fmode_t)0x10)
>  /* File is opened for execution with sys_execve / sys_uselib */
>  #define FMODE_EXEC		((__force fmode_t)0x20)
> +
> +/* File is opened as block device. */
> +#define FMODE_BLOCKDEV		((__force fmode_t)0x100)
>  /* 32bit hashes as llseek() offset (for directories) */
>  #define FMODE_32BITHASH         ((__force fmode_t)0x200)
>  /* 64bit hashes as llseek() offset (for directories) */
> -- 
> 2.43.0
> 
> [1.2]:
> Sketched-by: Christian Brauner <brauner@xxxxxxxxxx>
> ---
>  block/bdev.c           |  4 ++++
>  fs/super.c             | 17 +++++++++++++++++
>  include/linux/blkdev.h |  3 +++
>  3 files changed, 24 insertions(+)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index e7adaaf1c219..a0d5960dc2b9 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -627,6 +627,8 @@ static void bd_end_claim(struct block_device *bdev, void *holder)
>  		whole->bd_holder = NULL;
>  	mutex_unlock(&bdev_lock);
>  
> +	if (bdev->bd_holder_ops && bdev->bd_holder_ops->put_holder)
> +		bdev->bd_holder_ops->put_holder(holder);

That should move into bdev_release() obviously so it mirrors
bdev_open(). Plus, currently it's a nop because we just NULLed
bdev->bd_holder_ops above. But you get the idea, I hope.

>  	/*
>  	 * If this was the last claim, remove holder link and unblock evpoll if
>  	 * it was a write holder.
> @@ -902,6 +904,8 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
>  		bdev_file->f_mode |= FMODE_NOWAIT;
>  	bdev_file->f_mapping = bdev->bd_inode->i_mapping;
>  	bdev_file->f_wb_err = filemap_sample_wb_err(bdev_file->f_mapping);
> +	if (hops && hops->get_holder)
> +		hops->get_holder(holder);
>  	bdev_file->private_data = holder;
>  
>  	return 0;
> diff --git a/fs/super.c b/fs/super.c
> index ee05ab6b37e7..64dbbdbed93a 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1515,11 +1515,28 @@ static int fs_bdev_thaw(struct block_device *bdev)
>  	return error;
>  }
>  
> +static void fs_bdev_super_get(void *data)
> +{
> +	struct super_block *sb = data;
> +
> +	spin_lock(&sb_lock);
> +	sb->s_count++;
> +	spin_unlock(&sb_lock);
> +}
> +
> +static void fs_bdev_super_put(void *data)
> +{
> +	struct super_block *sb = data;
> +	put_super(sb);
> +}
> +
>  const struct blk_holder_ops fs_holder_ops = {
>  	.mark_dead		= fs_bdev_mark_dead,
>  	.sync			= fs_bdev_sync,
>  	.freeze			= fs_bdev_freeze,
>  	.thaw			= fs_bdev_thaw,
> +	.get_holder		= fs_bdev_super_get,
> +	.put_holder		= fs_bdev_super_put,
>  };
>  EXPORT_SYMBOL_GPL(fs_holder_ops);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f9b87c39cab0..d919e8bcb2c1 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1505,6 +1505,9 @@ struct blk_holder_ops {
>  	 * Thaw the file system mounted on the block device.
>  	 */
>  	int (*thaw)(struct block_device *bdev);
> +
> +	void (*get_holder)(void *holder);
> +	void (*put_holder)(void *holder);
>  };
>  
>  /*
> -- 
> 2.43.0
> 




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux