Re: [PATCH v2 01/34] bdev: open block device as files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 14, 2024 at 12:10:59PM +0100, Christian Brauner wrote:
> On Tue, Mar 12, 2024 at 07:32:35PM -0700, Christoph Hellwig wrote:
> > Now that this is in mainline it seems to cause blktests to crash
> > nbd/003 with a rather non-obvious oops for me:
> 
> Ok, will be looking into that next.

Ok, I know what's going on. Basically, fput() on the block device runs
asynchronously which means that bdev->bd_holder can still be set to @sb
after it has already been freed. Let me illustrate what I mean:

P1                                                 P2
mount(sb)                                          fd = open("/dev/nbd", ...)
-> file = bdev_file_open_by_dev(..., sb, ...)
   bdev->bd_holder = sb;

// Later on:

umount(sb)
->kill_block_super(sb)
|----> [fput() -> deferred via task work]
-> put_super(sb) -> frees the sb via rcu
|
|                                                 nbd_ioctl(NBD_CLEAR_SOCK)
|                                                 -> disk_force_media_change()
|                                                    -> bdev_mark_dead()
|                                                       -> fs_mark_dead()
|                                                          // Finds bdev->bd_holder == sb
|-> file->release::bdev_release()                          // Tries to get reference to it but it's freed; frees it again
   bdev->bd_holder = NULL;

Two solutions that come to my mind:

[1] Indicate to fput() that this is an internal block devices open and
    thus just close it synchronously. This is fine. First, because the block
    device superblock is never unmounted or anything so there's no risk
    from hanging there for any reason. Second, bdev_release() also ran
    synchronously so any issue that we'd see from calling
    file->f_op->release::bdev_release() we would have seen from
    bdev_release() itself. See [1.1] for a patch.

(2) Take a temporary reference to the holder when opening the block
    device. This is also fine afaict because we differentiate between
    passive and active references on superblocks and what we already do
    in fs_bdev_mark_dead() and friends. This mean we don't have to mess
    around with fput(). See [1.2] for a patch.

(3) Revert and cry. No patch.

Personally, I think (2) is more desirable because we don't lose the
async property of fput() and we don't need to have a new FMODE_* flag.
I'd like to do some more testing with this. Thoughts?

[1.1]:
Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
---
 block/bdev.c       | 1 +
 fs/file_table.c    | 5 +++++
 include/linux/fs.h | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index e7adaaf1c219..d0c208a04b04 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -969,6 +969,7 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
 		return bdev_file;
 	}
 	ihold(bdev->bd_inode);
+	bdev_file->f_mode |= FMODE_BLOCKDEV;
 
 	ret = bdev_open(bdev, mode, holder, hops, bdev_file);
 	if (ret) {
diff --git a/fs/file_table.c b/fs/file_table.c
index 4f03beed4737..48d35dd67020 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -473,6 +473,11 @@ void fput(struct file *file)
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		struct task_struct *task = current;
 
+		if (unlikely((file->f_mode & FMODE_BLOCKDEV))) {
+			__fput(file);
+			return;
+		}
+
 		if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
 			file_free(file);
 			return;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5d5a4ee24f0..ceac9c0316a6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -121,6 +121,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_PWRITE		((__force fmode_t)0x10)
 /* File is opened for execution with sys_execve / sys_uselib */
 #define FMODE_EXEC		((__force fmode_t)0x20)
+
+/* File is opened as block device. */
+#define FMODE_BLOCKDEV		((__force fmode_t)0x100)
 /* 32bit hashes as llseek() offset (for directories) */
 #define FMODE_32BITHASH         ((__force fmode_t)0x200)
 /* 64bit hashes as llseek() offset (for directories) */
-- 
2.43.0

[1.2]:
Sketched-by: Christian Brauner <brauner@xxxxxxxxxx>
---
 block/bdev.c           |  4 ++++
 fs/super.c             | 17 +++++++++++++++++
 include/linux/blkdev.h |  3 +++
 3 files changed, 24 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index e7adaaf1c219..a0d5960dc2b9 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -627,6 +627,8 @@ static void bd_end_claim(struct block_device *bdev, void *holder)
 		whole->bd_holder = NULL;
 	mutex_unlock(&bdev_lock);
 
+	if (bdev->bd_holder_ops && bdev->bd_holder_ops->put_holder)
+		bdev->bd_holder_ops->put_holder(holder);
 	/*
 	 * If this was the last claim, remove holder link and unblock evpoll if
 	 * it was a write holder.
@@ -902,6 +904,8 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
 		bdev_file->f_mode |= FMODE_NOWAIT;
 	bdev_file->f_mapping = bdev->bd_inode->i_mapping;
 	bdev_file->f_wb_err = filemap_sample_wb_err(bdev_file->f_mapping);
+	if (hops && hops->get_holder)
+		hops->get_holder(holder);
 	bdev_file->private_data = holder;
 
 	return 0;
diff --git a/fs/super.c b/fs/super.c
index ee05ab6b37e7..64dbbdbed93a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1515,11 +1515,28 @@ static int fs_bdev_thaw(struct block_device *bdev)
 	return error;
 }
 
+static void fs_bdev_super_get(void *data)
+{
+	struct super_block *sb = data;
+
+	spin_lock(&sb_lock);
+	sb->s_count++;
+	spin_unlock(&sb_lock);
+}
+
+static void fs_bdev_super_put(void *data)
+{
+	struct super_block *sb = data;
+	put_super(sb);
+}
+
 const struct blk_holder_ops fs_holder_ops = {
 	.mark_dead		= fs_bdev_mark_dead,
 	.sync			= fs_bdev_sync,
 	.freeze			= fs_bdev_freeze,
 	.thaw			= fs_bdev_thaw,
+	.get_holder		= fs_bdev_super_get,
+	.put_holder		= fs_bdev_super_put,
 };
 EXPORT_SYMBOL_GPL(fs_holder_ops);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f9b87c39cab0..d919e8bcb2c1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1505,6 +1505,9 @@ struct blk_holder_ops {
 	 * Thaw the file system mounted on the block device.
 	 */
 	int (*thaw)(struct block_device *bdev);
+
+	void (*get_holder)(void *holder);
+	void (*put_holder)(void *holder);
 };
 
 /*
-- 
2.43.0





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux