Re: VFS caching of file extents

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 29 Aug 2024 09:46:03 +1000

On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> Today it is the responsibility of each filesystem to maintain the mapping
> from file logical addresses to disk blocks (*).  There are various ways
> to query that information, eg calling get_block() or using iomap.
> 
> What if we pull that information up into the VFS? 

We explicitly pulled that information out of the VFS by moving away
from per-page bufferheads that stored the disk mapping for the
cached data to the on-demand query based iomap infrastructure.

> Filesystems obviously
> _control_ that information, so need to be able to invalidate entries.

Which is one of the reasons for keeping it out of the VFS...

Besides, which set of mapping information that the filesystem holds are we
talking about here?

FYI: XFS has *three* sets of mapping information per inode - the
data fork, the xattr fork and the COW fork. The data fork and the
COW fork both reference file data mappings, and they can overlap
whilst there are COW operations ongoing. Regular files can also have
xattr fork mappings.

Further, directories and symlinks have both data and xattr fork
based mappings, and they do not use the VFS for caching metadata -
that is all internal to the filesystem. Hence if we move to caching
mapping information in the VFS, we have to expose the VFS inode all
the way down into the lowest layers of the XFS metadata subsystem
when there is absolutely nothing that is Linux/VFS specific.

IOWs, if we don't cache mapping information natively in the
filesystem, we are forcing filesystems to drill VFS structures deep
into their internal metadata implementations.  Hence if you're
thinking purely about caching file data mappings at the VFS, then
what you're asking filesystems to support is multiple extent map
caching schemes instead of just one. 

And I'm largely ignoring the transactional change requirements
for extent maps, and how a VFS cache would place the cache locking
both above and below transaction boundaries. And then there's the
inevitable shrinker interface for reclaim of cached VFS extent maps
and the placement of the locking both above and below memory
allocation. That's a recipe for lockdep false positives all over the
place...

> And we wouldn't want to store all extents in the VFS all the time, so
> would need to have a way to call into the filesystem to populate ranges
> of files.

This would require substantial modification to filesysetms like XFS
that assume the mapping cache is always fully populated before a
lookup or modification is done. It's not just a case of "make sure
this range is populated", it's also a change of the entire locking
model for extent map access because cache population under a shared
lock is inherently racy.

> We'd need to decide how to lock/protect that information
> -- a per-file lock?  A per-extent lock?  No locking, just a seqcount?

Right now Xfs uses a private per-inode metadata rwsem for exclusion
and we generally don't have terrible contention problems with that strategy. Other
filesystems use private rwsems, too, but often they only protect
mapping operations, not all the metadata in the inode. Other filesystems
use per-extent locking.

As such, I'm not sure there is a "one size fits all" model here...

> We need a COW bit in the extent which tells the user that this extent
> is fine for reading through, but if there's a write to be done then the
> filesystem needs to be asked to create a new extent.

It's more than that - we need somewhere to hold the COW extent
mappings that we've allocated and overlap existing data mappings.
We do delayed allocation and/or preallocation with allocate-around
for COW to minimise fragmentation. Hence we have concurrent mappings
for the same file range for the existing data and where the dirty
cached data is going to end up being placed when it is finally
written. And then on IO completion we do the transactional update
to punch out the old data extent and swap in the new data extent
from the COW fork where we just wrote the new data to.

IOWs, managing COW mappings is much more complex than a simple flag
that says "this range needs allocation on writeback". Yes, we can do
unwritten extents like that (i.e. a simple flag in the extent to
say "do unwritten extent conversion on IO completion"), but COW is
much, much more complex...

> There are a few problems I think this can solve.  One is efficient
> implementation of NFS READPLUS.

How "inefficient" is an iomap implementation? It iterates one extent
at a time, and a readplus iterator can simply encode data and holes
as it queries teh range one extent at a time, right?

> Another is the callback from iomap
> to the filesystem when doing buffered writeback.

Filesystems need to do COW setup work or delayed allocation here, so
we have to call into the filesystem regardless of whether there is a
VFS mapping cache or not.

In that case the callout requires exclusive locking, but if it's an
overwrite, the callout only needs shared locking.  But until we call
into the filesystem we don't know what operation we have to perform
or which type of locks we have to take because the extent map can
change until we hold the internal extent map lock...

Fundamentally, I don't want operations like truncate, hole punch,
etc to have to grow *another* lock. We currently have to take
the inode lock, the invalidate lock and internal metadata locks
to lock everything out. With an independent mapping cache, we're
also going to have to take that lock as well to, especially if
things like writeback only use the mapping cache lock.

> A third is having a
> common implementation of FIEMAP.

We've already got that with iomap.

> I've heard rumours that FUSE would like
> something like this, and maybe there are other users that would crop up.
> 
> Anyway, this is as far as my thinking has got on this topic for now.
> Maybe there's a good idea here, maybe it's all a huge overengineered mess
> waiting to happen.  I'm sure other people know this area of filesystems
> better than I do.

Caching mapping state in the VFS has proven to be less than ideal in
the past for reasons of coherency and resource usage. We've
explicitly moved away from that model to an extent-query model with
iomap, and right now I'm not seeing any advantages or additional
functionality that caching extent maps in the VFS would bring over
the existing iomap model...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx