Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

David Howells <dhowells@xxxxxxxxxx> · Thu, 01 Apr 2021 14:51:06 +0100

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:

> > I've been working on getting NFS converted to dhowells new fscache
> > and
> > netfs APIs and running into a problem with how NFS is designed and it
> > involves the NFS pagelist.c / pgio API.  I'd appreciate it if you
> > could review and give your thoughts on possible approaches.  I've
> > tried to outline some of the possibilities below.  I tried coding
> > option #3 and ran into some problems, and it has a serialization
> > limitation.  At this point I'm leaning towards option 2, so I'll
> > probably try that approach if you don't have time for review or have
> > strong thoughts on it.
>
> I am not going through another redesign of the NFS code in order to
> accommodate another cachefs design. If netfs needs a refactoring or
> redesign of the I/O code then it will be immediately NACKed.
>
> Why does netfs need to know these details about the NFS code anyway?

There are some issues we have to deal with in fscache - and some
opportunities.

 (1) The way cachefiles reads data from the cache is very hacky (calling
     readpage on the backing filesystem and then installing an interceptor on
     the waitqueue for the PG_locked page flag on that page, then memcpying
     the page in a worker thread) - but it was the only way to do it at the
     time.  Unfortunately, it's fragile and it seems just occasionally the
     wake event is missed.

     Since then, kiocb has come along.  I really want to switch to using this
     to read/write the cache.  It's a lot more robust and also allows async
     DIO to be performed, also cutting out the memcpy.

     Changing the fscache IO part of API would make this easier.

 (2) The way cachefiles finds out whether data is present (using bmap) is not
     viable on ext4 or xfs and has to be changed.  This means I have to keep
     track of the presence of data myself, separately from the backing
     filesystem's own metadata.

     To reduce the amount of metadata I need to keep track of and store, I
     want to increase the cache granularity - that is I will only store, say,
     blocks of 256K.  But that needs to feed back up to the filesystem so that
     it can ask the VM to expand the readahead.

 (3) VM changes are coming that affect the filesystem address space
     operations.  THP is already here, though not rolled out into all
     filesystems yet.  Folios are (probably) on their way.  These manage with
     page aggregation.  There's a new readahead interface function too.

     This means, however, that you might get an aggregate page that is
     partially cached.  In addition, the current fscache IO API cannot deal
     with these.  I think only 9p, afs, ceph, cifs, nfs plus orangefs don't
     support THPs yet.  The first five Willy has held off on because fscache
     is a complication and there's an opportunity to make a single solution
     that fits all five.

     Also to this end, I'm trying to make it so that fscache doesn't retain
     any pointers back into the network filesystem structures, beyond the info
     provided to perform a cache op - and that is only required on a transient
     basis.

 (4) I'd like to be able to encrypt the data stored in the local cache and
     Jeff Layton is adding support for fscrypt to ceph.  It would be nice if
     we could share the solution with all of the aforementioned bunch of five
     filesystems by putting it into the common library.

So with the above, there is an opportunity to abstract handling of the VM I/O
ops for network filesystems - 9p, afs, ceph, cifs and nfs - into a common
library that handles VM I/O ops and translates them to RPC calls, cache reads
and cache writes.  The thought is that we should be able to push the
aggregation of pages into RPC calls there, handle rsize/wsize and allow
requests to be sliced up and so that they can be distributed to multiple
servers (works for ceph) so that all five filesystems can get the same
benefits in one go.

Btw, I'm also looking at changing the way indexing works, though that should
only very minorly alter the nfs code and doesn't require any restructuring.
I've simplified things a lot and I'm hoping to remove a couple of thousand
lines from fscache and cachefiles.

David