On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote: > Today it is the responsibility of each filesystem to maintain the mapping > from file logical addresses to disk blocks (*). There are various ways > to query that information, eg calling get_block() or using iomap. > > What if we pull that information up into the VFS? Filesystems obviously > _control_ that information, so need to be able to invalidate entries. > And we wouldn't want to store all extents in the VFS all the time, so > would need to have a way to call into the filesystem to populate ranges > of files. We'd need to decide how to lock/protect that information > -- a per-file lock? A per-extent lock? No locking, just a seqcount? > We need a COW bit in the extent which tells the user that this extent > is fine for reading through, but if there's a write to be done then the > filesystem needs to be asked to create a new extent. > > There are a few problems I think this can solve. One is efficient > implementation of NFS READPLUS. Another is the callback from iomap Wouldn't readplus (and maybe a sparse copy program) rather have something that is "SEEK_DATA, fill the buffer with data from that file position, and tell me what pos the data came from"? > to the filesystem when doing buffered writeback. A third is having a > common implementation of FIEMAP. I've heard rumours that FUSE would like > something like this, and maybe there are other users that would crop up. My 2-second hot take on this is that FUSE might benefit from an incore mapping cache, but only because (rcu)locking the cache to query it is likely faster than jumping out to userspace to ask the server process. If the fuse server could invalidate parts of that cache, that might not be too terrible. > Anyway, this is as far as my thinking has got on this topic for now. > Maybe there's a good idea here, maybe it's all a huge overengineered mess > waiting to happen. I'm sure other people know this area of filesystems > better than I do. I also suspect that devising a "simple" mapping tree for simple filesystems will quickly devolve into a mess of figuring out their adhoc locking and making that work. Even enabling iomap one long-tail fs at a time sounds like a 10 year project, and they already usually have some weird notion of coordination of mapping. "But then there's ext4" etc. --D > (*) For block device filesystems. Obviously network filesystems and > synthetic filesystems don't care and can stop reading now. Umm, unless > maybe they _want_ to use it, eg maybe there's a sharded thing going on and > the fs wants to store information about each shard in the extent cache? >