Re: VFS caching of file extents

Josef Bacik <josef@xxxxxxxxxxxxxx> · Wed, 28 Aug 2024 16:30:26 -0400

On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> Today it is the responsibility of each filesystem to maintain the mapping
> from file logical addresses to disk blocks (*).  There are various ways
> to query that information, eg calling get_block() or using iomap.
> 
> What if we pull that information up into the VFS?  Filesystems obviously
> _control_ that information, so need to be able to invalidate entries.
> And we wouldn't want to store all extents in the VFS all the time, so
> would need to have a way to call into the filesystem to populate ranges
> of files.  We'd need to decide how to lock/protect that information
> -- a per-file lock?  A per-extent lock?  No locking, just a seqcount?
> We need a COW bit in the extent which tells the user that this extent
> is fine for reading through, but if there's a write to be done then the
> filesystem needs to be asked to create a new extent.
> 

At least for btrfs we store a lot of things in our extent map, so I'm not sure
if everybody wants to share the overhead of the amount of information we keep
cached in these entries.

We also protect all that with an extent lock, which again I'm not entirely sure
everybody wants to adopt our extent locking.  If we pushed the locking
responsibility into the file system then hooray, but that makes the generic
implementation more complex.

> There are a few problems I think this can solve.  One is efficient
> implementation of NFS READPLUS.  Another is the callback from iomap
> to the filesystem when doing buffered writeback.  A third is having a
> common implementation of FIEMAP.  I've heard rumours that FUSE would like
> something like this, and maybe there are other users that would crop up.
> 

For us we actually stopped using our in memory cache for FIEMAP because it ended
up being way slower and kind of a pain to work with all the different ways we'll
update the cache based on io happening.  Our FIEMAP implementation just reads
the extents on disk because it's easier/cleaner to just walk through the btree
than the cache.

> Anyway, this is as far as my thinking has got on this topic for now.
> Maybe there's a good idea here, maybe it's all a huge overengineered mess
> waiting to happen.  I'm sure other people know this area of filesystems
> better than I do.

Maybe it's fine for simpler file systems, and it could probably be argued that
btrfs is a bit over-engineered in this case, but I worry it'll turn into one of
those "this seemed like a good idea at the time, but after we added all the
features everybody needed we ended up with something way more complex"
scenarios.  Thanks,

Josef