On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote: > Today it is the responsibility of each filesystem to maintain the mapping > from file logical addresses to disk blocks (*). There are various ways > to query that information, eg calling get_block() or using iomap. > > What if we pull that information up into the VFS? Filesystems obviously > _control_ that information, so need to be able to invalidate entries. > And we wouldn't want to store all extents in the VFS all the time, so > would need to have a way to call into the filesystem to populate ranges > of files. We'd need to decide how to lock/protect that information > -- a per-file lock? A per-extent lock? No locking, just a seqcount? > We need a COW bit in the extent which tells the user that this extent > is fine for reading through, but if there's a write to be done then the > filesystem needs to be asked to create a new extent. > At least for btrfs we store a lot of things in our extent map, so I'm not sure if everybody wants to share the overhead of the amount of information we keep cached in these entries. We also protect all that with an extent lock, which again I'm not entirely sure everybody wants to adopt our extent locking. If we pushed the locking responsibility into the file system then hooray, but that makes the generic implementation more complex. > There are a few problems I think this can solve. One is efficient > implementation of NFS READPLUS. Another is the callback from iomap > to the filesystem when doing buffered writeback. A third is having a > common implementation of FIEMAP. I've heard rumours that FUSE would like > something like this, and maybe there are other users that would crop up. > For us we actually stopped using our in memory cache for FIEMAP because it ended up being way slower and kind of a pain to work with all the different ways we'll update the cache based on io happening. Our FIEMAP implementation just reads the extents on disk because it's easier/cleaner to just walk through the btree than the cache. > Anyway, this is as far as my thinking has got on this topic for now. > Maybe there's a good idea here, maybe it's all a huge overengineered mess > waiting to happen. I'm sure other people know this area of filesystems > better than I do. Maybe it's fine for simpler file systems, and it could probably be argued that btrfs is a bit over-engineered in this case, but I worry it'll turn into one of those "this seemed like a good idea at the time, but after we added all the features everybody needed we ended up with something way more complex" scenarios. Thanks, Josef