On Mon, Mar 18, 2024 at 11:29:22AM +0100, Christian Brauner wrote: > On Mon, Mar 18, 2024 at 11:07:49AM +0100, Christian Brauner wrote: > > On Mon, Mar 18, 2024 at 03:19:03PM +0800, Yu Kuai wrote: > > > Hi, Christoph! > > > > > > 在 2024/03/18 9:51, Yu Kuai 写道: > > > > Hi, > > > > > > > > 在 2024/03/18 9:32, Christoph Hellwig 写道: > > > > > On Mon, Mar 18, 2024 at 09:26:48AM +0800, Yu Kuai wrote: > > > > > > Because there is a real filesystem(devtmpfs) used for raw block devcie > > > > > > file operations, open syscall to devtmpfs: > > Don't forget: > > mknod /my/xfs/file/system b 8 0 > > which means you're not opening it via devtmpfs but via xfs. IOW, the > inode for that file is from xfs. > > > > > > > > > > > > > blkdev_open > > > > > > bdev = blkdev_get_no_open > > > > > > bdev_open -> pass in file is from devtmpfs > > > > > > -> in this case, file inode is from devtmpfs, > > > > > > > > > > But file->f_mapping->host should still point to the bdevfs inode, > > > > > and file->f_mapping->host is what everything in the I/O path should > > > > > be using. > > > > I mentioned this in > > https://lore.kernel.org/r/20240118-gemustert-aalen-ee71d0c69826@brauner > > > > "[...] if we want to have all code pass a file and we have code in > > fs/buffer.c like iomap_to_bh(): > > > > iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh, > > loff_t offset = block << inode->i_blkbits; > > > > bh->b_bdev = iomap->bdev; > > + bh->f_b_bdev = iomap->f_bdev; > > > > While that works for every single filesystem that uses block devices > > because they stash them somewhere (like s_bdev_file) it doesn't work for > > the bdev filesystem itself. So if the bdev filesystem calls into helpers > > that expect e.g., buffer_head->s_f_bdev to have been initialized from > > iomap->f_bdev this wouldn't work. > > > > So if we want to remove b_bdev from struct buffer_head and fully rely on > > f_b_bdev - and similar in iomap - then we need a story for the bdev fs > > itself. And I wasn't clear on what that would be." > > > > > > > > > > > > > Then later, in blkdev_iomap_begin(), bd_inode is passed in and there is > > > > > > no access to the devtmpfs file, we can't use s_bdev_file() as other > > > > > > filesystems here. > > > > > > > > > > We can just pass the file down in iomap_iter.private > > > > > > > > I can do this for blkdev_read_folio(), however, for other ops like > > > > blkdev_writepages(), I can't find a way to pass the file to > > > > iomap_iter.private yet. > > > > > > > > Any suggestions? > > > > > > I come up with an ideal: > > > > > > While opening the block_device the first time, store the generated new > > > file in "bd_inode->i_private". And release it after the last opener > > > close the block_device. > > > > > > The advantages are: > > > - multiple openers can share the same bdev_file; > > > > You mean use the file stashed in bdev_inode->i_private only to retrieve > > the inode/mapping in the block layer ops. > > > > > - raw block device ops can use the bdev_file as well, and there is no > > > need to distinguish iomap/buffer_head for raw block_device; > > > > > > Please let me know what do you think? > > > > It's equally ugly but probably slightly less error prone than the union > > approach. But please make that separate patches on top of the series. The other issue with this on-demand inode->i_private allocation will be lifetime management. If you're doing some sort of writeback initiated from the filesystem then you're guaranteed that the file stashed in sb->bdev_file is aligned with the lifetime of the filesystem. All writeback related stuff that relies on inode's can rely on the superblock being valid while it is doing stuff. In your approach that guarantee can't be given easily. If someone opens a block device /dev/sda does some buffered writes and then closes it the file might be cleaned up while there's still operations ongoing that rely on the file stashed in inode->i_private to be valid. If on the other hand you allocate a stub file on-demand during bdev_open() and stash it in inode->i_private you need to make sure to avoid creating reference count cycles that keep the inode alive.