On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote: > Today it is the responsibility of each filesystem to maintain the mapping > from file logical addresses to disk blocks (*). There are various ways > to query that information, eg calling get_block() or using iomap. > > What if we pull that information up into the VFS? We explicitly pulled that information out of the VFS by moving away from per-page bufferheads that stored the disk mapping for the cached data to the on-demand query based iomap infrastructure. > Filesystems obviously > _control_ that information, so need to be able to invalidate entries. Which is one of the reasons for keeping it out of the VFS... Besides, which set of mapping information that the filesystem holds are we talking about here? FYI: XFS has *three* sets of mapping information per inode - the data fork, the xattr fork and the COW fork. The data fork and the COW fork both reference file data mappings, and they can overlap whilst there are COW operations ongoing. Regular files can also have xattr fork mappings. Further, directories and symlinks have both data and xattr fork based mappings, and they do not use the VFS for caching metadata - that is all internal to the filesystem. Hence if we move to caching mapping information in the VFS, we have to expose the VFS inode all the way down into the lowest layers of the XFS metadata subsystem when there is absolutely nothing that is Linux/VFS specific. IOWs, if we don't cache mapping information natively in the filesystem, we are forcing filesystems to drill VFS structures deep into their internal metadata implementations. Hence if you're thinking purely about caching file data mappings at the VFS, then what you're asking filesystems to support is multiple extent map caching schemes instead of just one. And I'm largely ignoring the transactional change requirements for extent maps, and how a VFS cache would place the cache locking both above and below transaction boundaries. And then there's the inevitable shrinker interface for reclaim of cached VFS extent maps and the placement of the locking both above and below memory allocation. That's a recipe for lockdep false positives all over the place... > And we wouldn't want to store all extents in the VFS all the time, so > would need to have a way to call into the filesystem to populate ranges > of files. This would require substantial modification to filesysetms like XFS that assume the mapping cache is always fully populated before a lookup or modification is done. It's not just a case of "make sure this range is populated", it's also a change of the entire locking model for extent map access because cache population under a shared lock is inherently racy. > We'd need to decide how to lock/protect that information > -- a per-file lock? A per-extent lock? No locking, just a seqcount? Right now Xfs uses a private per-inode metadata rwsem for exclusion and we generally don't have terrible contention problems with that strategy. Other filesystems use private rwsems, too, but often they only protect mapping operations, not all the metadata in the inode. Other filesystems use per-extent locking. As such, I'm not sure there is a "one size fits all" model here... > We need a COW bit in the extent which tells the user that this extent > is fine for reading through, but if there's a write to be done then the > filesystem needs to be asked to create a new extent. It's more than that - we need somewhere to hold the COW extent mappings that we've allocated and overlap existing data mappings. We do delayed allocation and/or preallocation with allocate-around for COW to minimise fragmentation. Hence we have concurrent mappings for the same file range for the existing data and where the dirty cached data is going to end up being placed when it is finally written. And then on IO completion we do the transactional update to punch out the old data extent and swap in the new data extent from the COW fork where we just wrote the new data to. IOWs, managing COW mappings is much more complex than a simple flag that says "this range needs allocation on writeback". Yes, we can do unwritten extents like that (i.e. a simple flag in the extent to say "do unwritten extent conversion on IO completion"), but COW is much, much more complex... > There are a few problems I think this can solve. One is efficient > implementation of NFS READPLUS. How "inefficient" is an iomap implementation? It iterates one extent at a time, and a readplus iterator can simply encode data and holes as it queries teh range one extent at a time, right? > Another is the callback from iomap > to the filesystem when doing buffered writeback. Filesystems need to do COW setup work or delayed allocation here, so we have to call into the filesystem regardless of whether there is a VFS mapping cache or not. In that case the callout requires exclusive locking, but if it's an overwrite, the callout only needs shared locking. But until we call into the filesystem we don't know what operation we have to perform or which type of locks we have to take because the extent map can change until we hold the internal extent map lock... Fundamentally, I don't want operations like truncate, hole punch, etc to have to grow *another* lock. We currently have to take the inode lock, the invalidate lock and internal metadata locks to lock everything out. With an independent mapping cache, we're also going to have to take that lock as well to, especially if things like writeback only use the mapping cache lock. > A third is having a > common implementation of FIEMAP. We've already got that with iomap. > I've heard rumours that FUSE would like > something like this, and maybe there are other users that would crop up. > > Anyway, this is as far as my thinking has got on this topic for now. > Maybe there's a good idea here, maybe it's all a huge overengineered mess > waiting to happen. I'm sure other people know this area of filesystems > better than I do. Caching mapping state in the VFS has proven to be less than ideal in the past for reasons of coherency and resource usage. We've explicitly moved away from that model to an extent-query model with iomap, and right now I'm not seeing any advantages or additional functionality that caching extent maps in the VFS would bring over the existing iomap model... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx