Re: [RFC v4 linux-next 19/19] fs & block: remove bdev->bd_inode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

在 2024/03/18 18:46, Christian Brauner 写道:
On Mon, Mar 18, 2024 at 11:29:22AM +0100, Christian Brauner wrote:
On Mon, Mar 18, 2024 at 11:07:49AM +0100, Christian Brauner wrote:
On Mon, Mar 18, 2024 at 03:19:03PM +0800, Yu Kuai wrote:
Hi, Christoph!

在 2024/03/18 9:51, Yu Kuai 写道:
Hi,

在 2024/03/18 9:32, Christoph Hellwig 写道:
On Mon, Mar 18, 2024 at 09:26:48AM +0800, Yu Kuai wrote:
Because there is a real filesystem(devtmpfs) used for raw block devcie
file operations, open syscall to devtmpfs:

Don't forget:

mknod /my/xfs/file/system b 8 0

which means you're not opening it via devtmpfs but via xfs. IOW, the
inode for that file is from xfs.

I think there is no difference from devtmpfs, no matter what file is
passed in from blkdev_open(), we'll find the only bd_inode and stash
new bdev_file here.


blkdev_open
   bdev = blkdev_get_no_open
   bdev_open -> pass in file is from devtmpfs
   -> in this case, file inode is from devtmpfs,

But file->f_mapping->host should still point to the bdevfs inode,
and file->f_mapping->host is what everything in the I/O path should
be using.

I mentioned this in
https://lore.kernel.org/r/20240118-gemustert-aalen-ee71d0c69826@brauner

"[...] if we want to have all code pass a file and we have code in
fs/buffer.c like iomap_to_bh():

iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
         loff_t offset = block << inode->i_blkbits;

         bh->b_bdev = iomap->bdev;
+       bh->f_b_bdev = iomap->f_bdev;

While that works for every single filesystem that uses block devices
because they stash them somewhere (like s_bdev_file) it doesn't work for
the bdev filesystem itself. So if the bdev filesystem calls into helpers
that expect e.g., buffer_head->s_f_bdev to have been initialized from
iomap->f_bdev this wouldn't work.

So if we want to remove b_bdev from struct buffer_head and fully rely on
f_b_bdev - and similar in iomap - then we need a story for the bdev fs
itself. And I wasn't clear on what that would be."


Then later, in blkdev_iomap_begin(), bd_inode is passed in and there is
no access to the devtmpfs file, we can't use s_bdev_file() as other
filesystems here.

We can just pass the file down in iomap_iter.private

I can do this for blkdev_read_folio(), however, for other ops like
blkdev_writepages(), I can't find a way to pass the file to
iomap_iter.private yet.

Any suggestions?

I come up with an ideal:

While opening the block_device the first time, store the generated new
file in "bd_inode->i_private". And release it after the last opener
close the block_device.

The advantages are:
  - multiple openers can share the same bdev_file;

You mean use the file stashed in bdev_inode->i_private only to retrieve
the inode/mapping in the block layer ops.

Yes. I mean in the first bdev_open() allocate a bdev_file and stash it,
and free it in the last bdev_release().

  - raw block device ops can use the bdev_file as well, and there is no
need to distinguish iomap/buffer_head for raw block_device;

Please let me know what do you think?

It's equally ugly but probably slightly less error prone than the union
approach. But please make that separate patches on top of the series.

The other issue with this on-demand inode->i_private allocation will be
lifetime management. If you're doing some sort of writeback initiated
from the filesystem then you're guaranteed that the file stashed in
sb->bdev_file is aligned with the lifetime of the filesystem. All
writeback related stuff that relies on inode's can rely on the
superblock being valid while it is doing stuff.

For raw block device, before bdev_release() is called for the last
opener(specifically bd_openers decreased to zero),
blkdev_flush_mapping() is called, hence raw block_device writeback
should always see valid 'bdev_file' that will be release in the last
bdev_release().

And 'blockdev_superblock' will always be there and is always valid.>
In your approach that guarantee can't be given easily. If someone opens
a block device /dev/sda does some buffered writes and then closes it the
file might be cleaned up while there's still operations ongoing that
rely on the file stashed in inode->i_private to be valid.

If on the other hand you allocate a stub file on-demand during
bdev_open() and stash it in inode->i_private you need to make sure to
avoid creating reference count cycles that keep the inode alive.

I'm thinking about use 'bdev_openers' to gurantee the lifetime. I can't
think of possible problems for now, however, I cound be wrong.

Thanks,
Kuai

.






[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux