Re: VFS caching of file extents

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 29 Aug 2024 16:05:36 +1000

On Wed, Aug 28, 2024 at 08:50:47PM +0100, Matthew Wilcox wrote:
> On Wed, Aug 28, 2024 at 03:46:34PM -0400, Chuck Lever wrote:
> > On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote:
> > > There are a few problems I think this can solve.  One is efficient
> > > implementation of NFS READPLUS.
> > 
> > To expand on this, we're talking about the Linux NFS server's
> > implementation of the NFSv4.2 READ_PLUS operation, which is
> > specified here:
> > 
> >   https://www.rfc-editor.org/rfc/rfc7862.html#section-15.10
> > 
> > The READ_PLUS operation can return an array of content segments that
> > include regular data, holes in the file, or data patterns. Knowing
> > how the filesystem stores a file would help NFSD identify where it
> > can return a representation of a hole rather than a string of actual
> > zeroes, for instance.
> 
> Thanks for the reference; I went looking for it and found only the
> draft.
> 
> Another thing this could help with is reducing page cache usage for
> very sparse files.  Today if we attempt to read() or page fault on a
> file hole, we allocate a fresh page of memory and ask the filesystem to
> fill it.  The filesystem notices that it's a hole and calls memset().
> If the VFS knew that the extent was a hole, it could use the shared zero
> page instead.  Don't know how much of a performance win this would be,
> but it might be useful.

Ah. OK. Maybe I see the reason you are asking this question now.

Buffered reads are still based on the old page-cache-first IO
mechanisms and so doing smart stuff with "filesystems things"
are difficult to do.

i.e. readahead allocates folios for the readahead range before it
asks the filesystem to do the readahead IO, it is unaware of how the
file is laid out. Hence it can't do smart things with holes.

And it paints the filesystems into a corner, too, because they can't
modify the set of folios that it needs to fill with data. Hence
the filesystem can't do smart things with holes or unwritten
extents, either.

To solve this, the proposal being made is to lift the filesystem
mapping information up into "the VFS" so that the existing buffered
read code has awareness of the file mapping. That allows this page
cache code to do smarter things. e.g. special case folio
instantiation w.r.t. sparse files (amongst other things).

Have I got that right?

If so, then we've been here before, and we've solve these problems
by inverting the IO path operations. i.e. we do filesystem mapping
operations first, then populate the page cache based on the mapping
being returned.

This is how the iomap buffered write path works, and that's the
reason it supports all the modern filesystem goodies realtively
easily.

The exception to this model in iomap is buffered reads (i.e.
readahead). We still just do what the page cache asks us to do, and
clearly that is now starting to hurt us in the same way the page
cache centric IO model was hurting us for buffered writes a decade
ago.

So, can we invert readahead like we did with buffered writes? That
is, we hand the readahead window that needs to be filled (i.e. a
{mapping, pos, len} tuple) to the filesystem (iomap) which can then
iterate mappings over the readahead range.  iomap_iter_readahead()
can then populate the page cache with appropriately sized folios and
do the IO, or use the zero page when over a hole or unwritten
extent...

Note that optimisations like zero-page-over-holes also need write
path changes. We'd need to change iomap_get_folio() to tell
__filemap_get_folio() to replace zero pages with newly allocated
writeable folios during write operations...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx