Hi,
在 2024/03/24 0:11, Christian Brauner 写道:
Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
default this option is set. When it is set the long-standing behavior
of being able to write to mounted block devices is enabled.
But in order to guard against unintended corruption by writing to the
block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
off. In that case it isn't possible to write to mounted block devices
anymore.
A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
which disallows concurrent BLK_OPEN_WRITE access. When we still had the
bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
the mode was passed around. Since we managed to get rid of the bdev
handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
on whether the file was opened writable and writes to that block device
are blocked. That logic doesn't work because we do allow
BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
I don't get it here, looks like there are no such use case. All users
passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.
Is the following root cause here?
1) t1 open with BLK_OPEN_WRITE
2) t2 open with BLK_OPEN_RESTRICT_WRITES, with bdev_block_writes(), yes
we don't wait for t1 to close;
3) t1 close, after the commit, bdev_unblock_writes() is called
unexpected.
Following openers will succeed although t2 doesn't close;
So fix the detection logic. Use O_EXCL as an indicator that
BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
for pidfds where O_EXCL means that this is a pidfd that refers to a
thread. For userspace open paths O_EXCL will never be retained but for
internal opens where we open files that are never installed into a file
descriptor table this is fine.
From the path blkdev_open(), the file is from devtmpfs, and user can
pass in O_EXCL for that file, and that file will be used later in
blkdev_release() -> bdev_release() -> bdev_yield_write_access().
Thanks,
Kuai
Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
directly be raised by userspace. It is implicitly raised during
mounting.
Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
unset.
Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
Reported-by: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@xxxxxxxxxxxxxxxxxxxx
Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
---
block/bdev.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 7a5f611c3d2e..f819f3086905 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -821,13 +821,12 @@ static void bdev_yield_write_access(struct file *bdev_file)
return;
bdev = file_bdev(bdev_file);
- /* Yield exclusive or shared write access. */
- if (bdev_file->f_mode & FMODE_WRITE) {
- if (bdev_writes_blocked(bdev))
- bdev_unblock_writes(bdev);
- else
- bdev->bd_writers--;
- }
+
+ /* O_EXCL is only set for internal BLK_OPEN_RESTRICT_WRITES. */
+ if (bdev_file->f_flags & O_EXCL)
+ bdev_unblock_writes(bdev);
+ else if (bdev_file->f_mode & FMODE_WRITE)
+ bdev->bd_writers--;
}
/**
@@ -946,6 +945,13 @@ static unsigned blk_to_file_flags(blk_mode_t mode)
else
WARN_ON_ONCE(true);
+ /*
+ * BLK_OPEN_RESTRICT_WRITES is never set from userspace and
+ * O_EXCL is stripped from userspace.
+ */
+ if (mode & BLK_OPEN_RESTRICT_WRITES)
+ flags |= O_EXCL;
+
if (mode & BLK_OPEN_NDELAY)
flags |= O_NDELAY;