Re: [PATCH v2 2/4] NFSD: Add READ_PLUS support for data segments

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Fri, 6 Feb 2015 15:28:00 -0500

On Fri, Feb 06, 2015 at 03:07:08PM -0500, Chuck Lever wrote:
> 
> On Feb 6, 2015, at 2:35 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> 
> > On Fri, Feb 06, 2015 at 01:44:15PM -0500, Chuck Lever wrote:
> >> 
> >> On Feb 6, 2015, at 12:59 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> >> 
> >>> On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote:
> >>>> 
> >>>> On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> >>>> 
> >>>>> 
> >>>>> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> >>>>> 
> >>>>>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote:
> >>>>>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote:
> >>>>>>>>> The problem is that the typical case of all data won't use splice
> >>>>>>>>> every with your patches as the 4.2 client will always send a READ_PLUS.
> >>>>>>>>> 
> >>>>>>>>> So we'll have to find a way to use it where it helps.  While we might be
> >>>>>>>>> able to add some hacks to only use splice for the first segment I guess
> >>>>>>>>> we just need to make the splice support generic enough in the long run.
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough.
> >>>>>>> 
> >>>>>>> You could also elect to never return more than one data segment as a
> >>>>>>> start:
> >>>>>>> 
> >>>>>>> In all situations, the
> >>>>>>> server may choose to return fewer bytes than specified by the client.
> >>>>>>> The client needs to check for this condition and handle the
> >>>>>>> condition appropriately.
> >>>>>> 
> >>>>>> Yeah, I think that was more-or-less what Anna's first attempt did and I
> >>>>>> said "what if that means more round trips"?  The client can't anticipate
> >>>>>> the short reads so it can't make up for this with parallelism.
> >>>>>> 
> >>>>>>> But doing any of these for a call that's really just an optimization
> >>>>>>> soudns odd.  I'd really like to see an evaluation of the READ_PLUS
> >>>>>>> impact on various workloads before offering it.
> >>>>>> 
> >>>>>> Yes, unfortunately I don't see a way to make this just an obvious win.
> >>>>> 
> >>>>> I don’t think a “win” is necessary. It simply needs to be no worse than
> >>>>> READ for current use cases.
> >>>>> 
> >>>>> READ_PLUS should be a win for the particular use cases it was
> >>>>> designed for (large sparsely-populated datasets). Without a
> >>>>> demonstrated benefit I think there’s no point in keeping it.
> >>>>> 
> >>>>>> (Is there any way we could make it so with better protocol?  Maybe RDMA
> >>>>>> could help get the alignment right in multiple-segment cases?  But then
> >>>>>> I think there needs to be some sort of language about RDMA, or else
> >>>>>> we're stuck with:
> >>>>>> 
> >>>>>> 	https://tools.ietf.org/html/rfc5667#section-5
> >>>>>> 
> >>>>>> which I think forces us to return READ_PLUS data inline, another
> >>>>>> possible READ_PLUS regression.)
> >>>> 
> >>>> Btw, if I understand this correctly:
> >>>> 
> >>>> Without a spec update, a large NFS READ_PLUS reply would be returned
> >>>> in a reply list, which is moved via RDMA WRITE, just like READ
> >>>> replies.
> >>>> 
> >>>> The difference is NFS READ payload is placed directly into the
> >>>> client’s page cache by the adapter. With a reply list, the client
> >>>> transport would need to copy the returned data into the page cache.
> >>>> And a large reply buffer would be needed.
> >>>> 
> >>>> So, slower, yes. But not inline.
> >>> 
> >>> I'm not very good at this, bear with me, but: the above-referenced
> >>> section doesn't talk about "reply lists", only "write lists", and only
> >>> explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline.
> >> 
> >> I may have some details wrong, but this is my understanding.
> >> 
> >> Small replies are sent inline. There is a size maximum for inline
> >> messages, however. I guess 5667 section 5 assumes this context, which
> >> appears throughout RFC 5666.
> >> 
> >> If an expected reply exceeds the inline size, then a client will
> >> set up a reply list for the server. A memory region on the client is
> >> registered as a target for RDMA WRITE operations, and the co-ordinates
> >> of that region are sent to the server in the RPC call.
> >> 
> >> If the server finds the reply will indeed be larger than the inline
> >> maximum, it plants the reply in the client memory region described by
> >> the request’s reply list, and repeats the co-ordinates of that region
> >> back to the client in the RPC reply.
> >> 
> >> A server may also choose to send a small reply inline, even if the
> >> client provided a reply list. In that case, the server does not
> >> repeat the reply list in the reply, and the full reply appears
> >> inline.
> >> 
> >> Linux registers part of the RPC reply buffer for the reply list. After
> >> it is received on the client, the reply payload is copied by the client
> >> CPU to its final destination.
> >> 
> >> Inline and reply list are the mechanisms used when the upper layer
> >> has some processing to do to the incoming data (eg READDIR). When
> >> a request just needs raw data to be simply dropped off in the client’s
> >> memory, then the write list is preferred. A write list is basically a
> >> zero-copy I/O.
> > 
> > The term "reply list" doesn't appear in either RFC.  I believe you mean
> > "client-posted write list" in most of the above, except this last
> > paragraph, which should have started with "Inline and server-posted read list...”  ?
> 
> No, I meant “reply list.” Definitely not read list.
> 
> The terms used in the RFCs and the implementations vary,

OK.  Would you mind defining the term "reply list" for me?  Google's not
helping.

--b.

> unfortunately, and only the read list is an actual list. The write and
> reply lists are actually two separate counted arrays that are both
> expressed using xdr_write_list.
> 
> Have a look at RFC 5666, section 5.2, where it is referred to as
> either a “long reply” or a “reply chunk.”
> 
> >> But these choices are fixed by the specified RPC/RDMA binding of the
> >> upper layer protocol (that’s what RFC 5667 is). NFS READ and READLINK
> >> are the only NFS operations allowed to use a write list. (NFSv4
> >> compounds are somewhat ambiguous, and that too needs to be addressed).
> >> 
> >> As READ_PLUS conveys both kinds of data (zero-copy and data that
> >> might require some processing) IMO RFC 5667 does not provide adequate
> >> guidance about how to convey READ_PLUS. It will need to be added
> >> somewhere.
> > 
> > OK, good.  I wonder how it would do this.  The best the client could do,
> > I guess, is provide the same write list it would for a READ of the same
> > extent.  Could the server then write just the pieces of that extent it
> > needs to, send the hole information inline, and leave it to the client
> > to do any necessary zeroing?  (And is any of this worth it?)
> 
> Conveying large data payloads using zero-copy techniques should be
> beneficial.
> 
> Since hole information could appear in a reply list if it were large,
> and thus would not be inline, technically speaking, the best we can
> say is that hole information wouldn’t be eligible for the write list.
> 
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html