Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 1 Apr 2021 16:13:39 +0100

On Thu, Apr 01, 2021 at 02:51:06PM +0100, David Howells wrote:
>  (1) The way cachefiles reads data from the cache is very hacky (calling
>      readpage on the backing filesystem and then installing an interceptor on
>      the waitqueue for the PG_locked page flag on that page, then memcpying
>      the page in a worker thread) - but it was the only way to do it at the
>      time.  Unfortunately, it's fragile and it seems just occasionally the
>      wake event is missed.
> 
>      Since then, kiocb has come along.  I really want to switch to using this
>      to read/write the cache.  It's a lot more robust and also allows async
>      DIO to be performed, also cutting out the memcpy.
> 
>      Changing the fscache IO part of API would make this easier.

I agree with this.  The current way that fscache works is grotesque.
It knows far too much about the inner workings of, well, everything.

>  (3) VM changes are coming that affect the filesystem address space
>      operations.  THP is already here, though not rolled out into all
>      filesystems yet.  Folios are (probably) on their way.  These manage with
>      page aggregation.  There's a new readahead interface function too.
> 
>      This means, however, that you might get an aggregate page that is
>      partially cached.  In addition, the current fscache IO API cannot deal
>      with these.  I think only 9p, afs, ceph, cifs, nfs plus orangefs don't
>      support THPs yet.  The first five Willy has held off on because fscache
>      is a complication and there's an opportunity to make a single solution
>      that fits all five.

This isn't quite uptodate:

 - The new readahead interface went into Linux in June 2020.
   All filesystems were converted from readpages to readahead except
   for the five above that use fscache.  It would be nice to remove the
   readpages operation, but I'm trying not to get in anyone's way here,
   and the fscache->netfs transition was already underway.
 - THPs in Linux today are available to precisely one filesystem --
   shmem/tmpfs.  There are problems all over the place with using THPs
   for non-in-memory filesystems.
 - My THP work achieved POC status.  I got some of the prerequistite
   bits in, but have now stopped working on it (see next point).  It works
   pretty darned well on XFS only.  I did do some work towards enabling
   it on NFS, but never tested it.  There's a per-filesystem enable bit,
   so in theory NFS never needs to be converted.  In practice, you're
   going to want to for the performance boost.
 - I'm taking the lessons learned as part of the THP work (it's confusing
   when a struct page may refer to part of a large memory allocation or
   all of a large memory allocation) and introducing a new data type
   (the struct folio) to refer to chunks of memory in the page cache.
   All filesystems are going to have to be converted to the new API,
   so the fewer places that filesystems actually deal with struct page,
   the easier this makes the transition.
 - I don't understand how a folio gets to be partially cached.  Cached
   should be tracked on a per-folio basis (like dirty or uptodate), not
   on a per-page basis.  The point of the folio work is that managing
   memory in page-sized chunks is now too small for good performance.

> So with the above, there is an opportunity to abstract handling of the VM I/O
> ops for network filesystems - 9p, afs, ceph, cifs and nfs - into a common
> library that handles VM I/O ops and translates them to RPC calls, cache reads
> and cache writes.  The thought is that we should be able to push the
> aggregation of pages into RPC calls there, handle rsize/wsize and allow
> requests to be sliced up and so that they can be distributed to multiple
> servers (works for ceph) so that all five filesystems can get the same
> benefits in one go.

If NFS wants to do its own handling of rsize/wsize, could it?  That is,
if the VM passes it a 2MB page and says "read it", and the server has
an rsize of 256kB, could NFS split it up and send its own stream of 8
requests, or does it have to use fscache to do that?