On Wed, Aug 28, 2024 at 08:50:47PM +0100, Matthew Wilcox wrote: > On Wed, Aug 28, 2024 at 03:46:34PM -0400, Chuck Lever wrote: > > On Wed, Aug 28, 2024 at 08:34:00PM +0100, Matthew Wilcox wrote: > > > There are a few problems I think this can solve. One is efficient > > > implementation of NFS READPLUS. > > > > To expand on this, we're talking about the Linux NFS server's > > implementation of the NFSv4.2 READ_PLUS operation, which is > > specified here: > > > > https://www.rfc-editor.org/rfc/rfc7862.html#section-15.10 > > > > The READ_PLUS operation can return an array of content segments that > > include regular data, holes in the file, or data patterns. Knowing > > how the filesystem stores a file would help NFSD identify where it > > can return a representation of a hole rather than a string of actual > > zeroes, for instance. > > Thanks for the reference; I went looking for it and found only the > draft. > > Another thing this could help with is reducing page cache usage for > very sparse files. Today if we attempt to read() or page fault on a > file hole, we allocate a fresh page of memory and ask the filesystem to > fill it. The filesystem notices that it's a hole and calls memset(). > If the VFS knew that the extent was a hole, it could use the shared zero > page instead. Don't know how much of a performance win this would be, > but it might be useful. Ah. OK. Maybe I see the reason you are asking this question now. Buffered reads are still based on the old page-cache-first IO mechanisms and so doing smart stuff with "filesystems things" are difficult to do. i.e. readahead allocates folios for the readahead range before it asks the filesystem to do the readahead IO, it is unaware of how the file is laid out. Hence it can't do smart things with holes. And it paints the filesystems into a corner, too, because they can't modify the set of folios that it needs to fill with data. Hence the filesystem can't do smart things with holes or unwritten extents, either. To solve this, the proposal being made is to lift the filesystem mapping information up into "the VFS" so that the existing buffered read code has awareness of the file mapping. That allows this page cache code to do smarter things. e.g. special case folio instantiation w.r.t. sparse files (amongst other things). Have I got that right? If so, then we've been here before, and we've solve these problems by inverting the IO path operations. i.e. we do filesystem mapping operations first, then populate the page cache based on the mapping being returned. This is how the iomap buffered write path works, and that's the reason it supports all the modern filesystem goodies realtively easily. The exception to this model in iomap is buffered reads (i.e. readahead). We still just do what the page cache asks us to do, and clearly that is now starting to hurt us in the same way the page cache centric IO model was hurting us for buffered writes a decade ago. So, can we invert readahead like we did with buffered writes? That is, we hand the readahead window that needs to be filled (i.e. a {mapping, pos, len} tuple) to the filesystem (iomap) which can then iterate mappings over the readahead range. iomap_iter_readahead() can then populate the page cache with appropriately sized folios and do the IO, or use the zero page when over a hole or unwritten extent... Note that optimisations like zero-page-over-holes also need write path changes. We'd need to change iomap_get_folio() to tell __filemap_get_folio() to replace zero pages with newly allocated writeable folios during write operations... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx