Re: [PATCH v2 2/4] NFSD: Add READ_PLUS support for data segments

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 6 Feb 2015 15:07:08 -0500

On Feb 6, 2015, at 2:35 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:

> On Fri, Feb 06, 2015 at 01:44:15PM -0500, Chuck Lever wrote:
>> 
>> On Feb 6, 2015, at 12:59 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>> 
>>> On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote:
>>>> 
>>>> On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>> 
>>>>> 
>>>>> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>>>> 
>>>>>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote:
>>>>>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote:
>>>>>>>>> The problem is that the typical case of all data won't use splice
>>>>>>>>> every with your patches as the 4.2 client will always send a READ_PLUS.
>>>>>>>>> 
>>>>>>>>> So we'll have to find a way to use it where it helps.  While we might be
>>>>>>>>> able to add some hacks to only use splice for the first segment I guess
>>>>>>>>> we just need to make the splice support generic enough in the long run.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough.
>>>>>>> 
>>>>>>> You could also elect to never return more than one data segment as a
>>>>>>> start:
>>>>>>> 
>>>>>>> In all situations, the
>>>>>>> server may choose to return fewer bytes than specified by the client.
>>>>>>> The client needs to check for this condition and handle the
>>>>>>> condition appropriately.
>>>>>> 
>>>>>> Yeah, I think that was more-or-less what Anna's first attempt did and I
>>>>>> said "what if that means more round trips"?  The client can't anticipate
>>>>>> the short reads so it can't make up for this with parallelism.
>>>>>> 
>>>>>>> But doing any of these for a call that's really just an optimization
>>>>>>> soudns odd.  I'd really like to see an evaluation of the READ_PLUS
>>>>>>> impact on various workloads before offering it.
>>>>>> 
>>>>>> Yes, unfortunately I don't see a way to make this just an obvious win.
>>>>> 
>>>>> I don’t think a “win” is necessary. It simply needs to be no worse than
>>>>> READ for current use cases.
>>>>> 
>>>>> READ_PLUS should be a win for the particular use cases it was
>>>>> designed for (large sparsely-populated datasets). Without a
>>>>> demonstrated benefit I think there’s no point in keeping it.
>>>>> 
>>>>>> (Is there any way we could make it so with better protocol?  Maybe RDMA
>>>>>> could help get the alignment right in multiple-segment cases?  But then
>>>>>> I think there needs to be some sort of language about RDMA, or else
>>>>>> we're stuck with:
>>>>>> 
>>>>>> 	https://tools.ietf.org/html/rfc5667#section-5
>>>>>> 
>>>>>> which I think forces us to return READ_PLUS data inline, another
>>>>>> possible READ_PLUS regression.)
>>>> 
>>>> Btw, if I understand this correctly:
>>>> 
>>>> Without a spec update, a large NFS READ_PLUS reply would be returned
>>>> in a reply list, which is moved via RDMA WRITE, just like READ
>>>> replies.
>>>> 
>>>> The difference is NFS READ payload is placed directly into the
>>>> client’s page cache by the adapter. With a reply list, the client
>>>> transport would need to copy the returned data into the page cache.
>>>> And a large reply buffer would be needed.
>>>> 
>>>> So, slower, yes. But not inline.
>>> 
>>> I'm not very good at this, bear with me, but: the above-referenced
>>> section doesn't talk about "reply lists", only "write lists", and only
>>> explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline.
>> 
>> I may have some details wrong, but this is my understanding.
>> 
>> Small replies are sent inline. There is a size maximum for inline
>> messages, however. I guess 5667 section 5 assumes this context, which
>> appears throughout RFC 5666.
>> 
>> If an expected reply exceeds the inline size, then a client will
>> set up a reply list for the server. A memory region on the client is
>> registered as a target for RDMA WRITE operations, and the co-ordinates
>> of that region are sent to the server in the RPC call.
>> 
>> If the server finds the reply will indeed be larger than the inline
>> maximum, it plants the reply in the client memory region described by
>> the request’s reply list, and repeats the co-ordinates of that region
>> back to the client in the RPC reply.
>> 
>> A server may also choose to send a small reply inline, even if the
>> client provided a reply list. In that case, the server does not
>> repeat the reply list in the reply, and the full reply appears
>> inline.
>> 
>> Linux registers part of the RPC reply buffer for the reply list. After
>> it is received on the client, the reply payload is copied by the client
>> CPU to its final destination.
>> 
>> Inline and reply list are the mechanisms used when the upper layer
>> has some processing to do to the incoming data (eg READDIR). When
>> a request just needs raw data to be simply dropped off in the client’s
>> memory, then the write list is preferred. A write list is basically a
>> zero-copy I/O.
> 
> The term "reply list" doesn't appear in either RFC.  I believe you mean
> "client-posted write list" in most of the above, except this last
> paragraph, which should have started with "Inline and server-posted read list...”  ?

No, I meant “reply list.” Definitely not read list.

The terms used in the RFCs and the implementations vary,
unfortunately, and only the read list is an actual list. The write and
reply lists are actually two separate counted arrays that are both
expressed using xdr_write_list.

Have a look at RFC 5666, section 5.2, where it is referred to as
either a “long reply” or a “reply chunk.”

>> But these choices are fixed by the specified RPC/RDMA binding of the
>> upper layer protocol (that’s what RFC 5667 is). NFS READ and READLINK
>> are the only NFS operations allowed to use a write list. (NFSv4
>> compounds are somewhat ambiguous, and that too needs to be addressed).
>> 
>> As READ_PLUS conveys both kinds of data (zero-copy and data that
>> might require some processing) IMO RFC 5667 does not provide adequate
>> guidance about how to convey READ_PLUS. It will need to be added
>> somewhere.
> 
> OK, good.  I wonder how it would do this.  The best the client could do,
> I guess, is provide the same write list it would for a READ of the same
> extent.  Could the server then write just the pieces of that extent it
> needs to, send the hole information inline, and leave it to the client
> to do any necessary zeroing?  (And is any of this worth it?)

Conveying large data payloads using zero-copy techniques should be
beneficial.

Since hole information could appear in a reply list if it were large,
and thus would not be inline, technically speaking, the best we can
say is that hole information wouldn’t be eligible for the write list.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html