Re: VFS caching of file extents

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 28 Aug 2024 15:46:34 -0400

On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> Today it is the responsibility of each filesystem to maintain the mapping
> from file logical addresses to disk blocks (*).  There are various ways
> to query that information, eg calling get_block() or using iomap.
> 
> What if we pull that information up into the VFS?  Filesystems obviously
> _control_ that information, so need to be able to invalidate entries.
> And we wouldn't want to store all extents in the VFS all the time, so
> would need to have a way to call into the filesystem to populate ranges
> of files.  We'd need to decide how to lock/protect that information
> -- a per-file lock?  A per-extent lock?  No locking, just a seqcount?
> We need a COW bit in the extent which tells the user that this extent
> is fine for reading through, but if there's a write to be done then the
> filesystem needs to be asked to create a new extent.
> 
> There are a few problems I think this can solve.  One is efficient
> implementation of NFS READPLUS.

To expand on this, we're talking about the Linux NFS server's
implementation of the NFSv4.2 READ_PLUS operation, which is
specified here:

  https://www.rfc-editor.org/rfc/rfc7862.html#section-15.10

The READ_PLUS operation can return an array of content segments that
include regular data, holes in the file, or data patterns. Knowing
how the filesystem stores a file would help NFSD identify where it
can return a representation of a hole rather than a string of actual
zeroes, for instance.

> Another is the callback from iomap
> to the filesystem when doing buffered writeback.  A third is having a
> common implementation of FIEMAP.  I've heard rumours that FUSE would like
> something like this, and maybe there are other users that would crop up.
> 
> Anyway, this is as far as my thinking has got on this topic for now.
> Maybe there's a good idea here, maybe it's all a huge overengineered mess
> waiting to happen.  I'm sure other people know this area of filesystems
> better than I do.
> 
> (*) For block device filesystems.  Obviously network filesystems and
> synthetic filesystems don't care and can stop reading now.  Umm, unless
> maybe they _want_ to use it, eg maybe there's a sharded thing going on and
> the fs wants to store information about each shard in the extent cache?
> 

-- 
Chuck Lever