Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

David Wysochanski <dwysocha@xxxxxxxxxx> · Wed, 31 Mar 2021 15:08:13 -0400

On Wed, Mar 31, 2021 at 2:51 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Wed, 2021-03-31 at 14:34 -0400, David Wysochanski wrote:
> > On Wed, Mar 31, 2021 at 2:04 PM Trond Myklebust <
> > trondmy@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, 2021-03-31 at 13:49 -0400, David Wysochanski wrote:
> > > > Trond,
> > > >
> > > > I've been working on getting NFS converted to dhowells new
> > > > fscache
> > > > and
> > > > netfs APIs and running into a problem with how NFS is designed
> > > > and it
> > > > involves the NFS pagelist.c / pgio API.  I'd appreciate it if you
> > > > could review and give your thoughts on possible approaches.  I've
> > > > tried to outline some of the possibilities below.  I tried coding
> > > > option #3 and ran into some problems, and it has a serialization
> > > > limitation.  At this point I'm leaning towards option 2, so I'll
> > > > probably try that approach if you don't have time for review or
> > > > have
> > > > strong thoughts on it.
> > > >
> > >
> > > I am not going through another redesign of the NFS code in order to
> > > accommodate another cachefs design. If netfs needs a refactoring or
> > > redesign of the I/O code then it will be immediately NACKed.
> > >
> > I don't think it will require a redesign.  I was thinking more about
> > adding a flag to nfs_pageio_add_request() for example that
> > would return a different value if coalesce of the page being
> > added failed.  So we'd have:
> > nfs_pageio_init(): called 1 time
> > nfs_pageio_add_request(): called N times, one for each page, but stop
> > if
> > coalesce fails
> > nfs_pageio_complete(): called 1 time
> >
> > > Why does netfs need to know these details about the NFS code
> > > anyway?
> > >
> > We can probably get by without it but it will be awkward and probably
> > not the
> > best, but I'm not sure.
> >
> > I tried to explain below with a problem statement but maybe it was
> > unclear.
> > The basic design of netfs API as it pertains to this problem is:
> > * issue_op(): calls into the specific netfs (NFS) to obtain the data
> > from server
> > (send one or more RPCs)
> > * netfs_subreq_terminated(): when RPC(s) are completed, we need to
> > call
> > netfs API back to say the data is either there or there was an error
> >
>
> That's a problem. That means NFS and netfs need intimate knowledge of
> each other's design, and I'm not going allow us to go there again. We
> did that with cachefs, and here we are 10 years later doing a full
> redesign. That's unacceptable.
>

I don't think it's a full redesign, and my goal all along has been minimally
invasive to existing NFS.

I forgot to mention this part of netfs:
* clamp_length(): netfs calls into NFS here and we can clamp the length
to a specific size (for example 'rsize' for reads); this is what I'm trying to
see if I can implement fully or not but looks more complicated due to
coalescing failing, etc.  If not then there's other possible approaches
(NETFS_SREQ_SHORT_READ) but not sure they are ideal.

> If netfs requires these detailed changes to the NFS code, then that's a
> red flag that the whole design is broken and needs to be revised.
>
> > I would note that assuming we can come up with something acceptable
> > to
> > NFS, it should simplify both nfs_readpage() and
> > nfs_readpages/nfs_readhead.
> > I hope we can find some common ground so it's neither too invasive to
> > NFS and also maybe there's some similar improvements in NFS that can
> > be done along with this interface.
> >
> >
> > > > Thanks.
> > > >
> > > >
> > > > Problem: The NFS pageio interface does not expose a max read
> > > > length
> > > > that
> > > > we can easily use inside netfs clamp_length() function.  As a
> > > > result,
> > > > when
> > > > issue_op() is called indicating a single netfs subrequest, this
> > > > can
> > > > be
> > > > split into
> > > > multiple NFS subrequests / RPCs inside guts of NFS pageio code.
> > > > Multiple
> > > > NFS subrequests requests leads to multiple completions, and the
> > > > netfs
> > > > API expects a 1:1 mapping between issue_op() and
> > > > netfs_subreq_terminated() calls.
> > > >
> > > > Details of the NFS pageio API (see include/linux/nfs_page.h and
> > > > fs/nfs/pagelist.c)
> > > > Details of the netfs API (see include/linux/netfs.h and
> > > > fs/netfs/read_helper.c)
> > > >
> > > > The NFS pageio API 3 main calls are as follows:
> > > > 1. nfs_pageio_init(): initialize a pageio structure (R/W IO of N
> > > > pages)
> > > > 2. nfs_pageio_add_request(): called for each page to add to an IO
> > > > * Calls nfs_pageio_add_request_mirror -> __nfs_pageio_add_request
> > > >   * __nfs_pageio_add_request may call nfs_pageio_doio() which
> > > > actually
> > > >     sends an RPC over the wire if page cannot be added to the
> > > > request
> > > >     ("coalesced") due to various factors.  For more details, see
> > > >     nfs_pageio_do_add_request() and all underlying code it calls
> > > > such
> > > >     as nfs_coalesce_size() and subsequent pgio->pg_ops->pg_test()
> > > > calls
> > > > 3. nfs_pageio_complete() - "complete" the pageio
> > > > * calls nfs_pageio_complete_mirror -> nfs_pageio_doio()
> > > >
> > > > The NFS pageio API thus may generate multiple over the wire RPCs
> > > > and thus multiple completions even though at the high level only
> > > > one call to nfs_pageio_complete() is made.
> > > >
> > > > Option 1: Just use NFS pageio API as is, and deal with possible
> > > > multiple
> > > > completions.
> > > > - Inconsistent with netfs design intent
> > > > - Need to keep track of the RPC completion status, and for
> > > > example,
> > > > if one completes with success and one an error, probably call
> > > > netfs_subreq_terminated() with the error.
> > > > - There's no way for the caller of the NFS pageio API to know how
> > > > many RPCs and thus completions may occur.  Thus, it's unclear how
> > > > one would distinguish between a READ that resulted in a single
> > > > RPC
> > > > over the wire that completed as a short read, and a READ that
> > > > resulted in multiple RPCs that would each complete separately,
> > > > but would eventually complete
> > > >
> > > > Option 2: Create a more complex 'clamp_length()' function for
> > > > NFS,
> > > > taking into account all ways NFS / pNFS code can split a read.
> > > > + Consistent with netfs design intent
> > > > + Multiple "split" requests would be called in parallel (see loop
> > > > inside netfs_readahead, which repeatedly calls
> > > > netfs_rreq_submit_slice)
> > > > - Looks impossible without refactoring of NFS pgio API.  We need
> > > > to prevent nfs_pageio_add_request() from calling
> > > > nfs_pagio_doio(),
> > > > and return some indication coalesce failed.  In addition, it may
> > > > run into problems with fallback from DS to MDS for example (see
> > > > commit d9156f9f364897e93bdd98b4ad22138de18f7c24).
> > > >
> > > > Option 3: Utilize NETFS_SREQ_SHORT_READ flag as needed.
> > > > + Consistent with netfs design intent
> > > > - Multiple "split" requests would be serialized (see code
> > > > paths inside netfs_subreq_terminated that check for this flag).
> > > > - Looks impossible without some refactoring of NFS pgio API.
> > > > * Notes: Terminate NFS pageio page based loop at the first call
> > > > to nfs_pageio_doio().  When a READ completes, NFS calls
> > > > netfs_subreq_terminated() with NETFS_SREQ_SHORT_READ
> > > > and is prepared to have the rest of the subrequest be
> > > > resubmitted.
> > > > Need to somehow fail early or avoid entirely subsequent calls to
> > > > nfs_pagio_doio() for the original request though, and handle
> > > > error status only from the first RPC.
> > > >
> > > > Option 4: Add some final completion routine to be called near
> > > > bottom of nfs_pageio_complete() and would pass in at least
> > > > netfs_read_subrequest(), possibly nfs_pageio_descriptor.
> > > > + Inconsistent with netfs design intent
> > > > - Would be a new NFS API or call on top of everything
> > > > - Need to handle the "multiple completion with different
> > > > status" problem (see #1).
> > > >
> > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > trond.myklebust@xxxxxxxxxxxxxxx
> > >
> > >
> >
>
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4984 El Camino Real, Suite 208
> Los Altos, CA 94022
>
> www.hammer.space
>