Re: [PATCH v2 2/4] NFSD: Add READ_PLUS support for data segments

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 6 Feb 2015 13:44:15 -0500

On Feb 6, 2015, at 12:59 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:

> On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote:
>> 
>> On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>> 
>>> 
>>> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>> 
>>>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote:
>>>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote:
>>>>>>> The problem is that the typical case of all data won't use splice
>>>>>>> every with your patches as the 4.2 client will always send a READ_PLUS.
>>>>>>> 
>>>>>>> So we'll have to find a way to use it where it helps.  While we might be
>>>>>>> able to add some hacks to only use splice for the first segment I guess
>>>>>>> we just need to make the splice support generic enough in the long run.
>>>>>>> 
>>>>>> 
>>>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough.
>>>>> 
>>>>> You could also elect to never return more than one data segment as a
>>>>> start:
>>>>> 
>>>>> In all situations, the
>>>>> server may choose to return fewer bytes than specified by the client.
>>>>> The client needs to check for this condition and handle the
>>>>> condition appropriately.
>>>> 
>>>> Yeah, I think that was more-or-less what Anna's first attempt did and I
>>>> said "what if that means more round trips"?  The client can't anticipate
>>>> the short reads so it can't make up for this with parallelism.
>>>> 
>>>>> But doing any of these for a call that's really just an optimization
>>>>> soudns odd.  I'd really like to see an evaluation of the READ_PLUS
>>>>> impact on various workloads before offering it.
>>>> 
>>>> Yes, unfortunately I don't see a way to make this just an obvious win.
>>> 
>>> I don’t think a “win” is necessary. It simply needs to be no worse than
>>> READ for current use cases.
>>> 
>>> READ_PLUS should be a win for the particular use cases it was
>>> designed for (large sparsely-populated datasets). Without a
>>> demonstrated benefit I think there’s no point in keeping it.
>>> 
>>>> (Is there any way we could make it so with better protocol?  Maybe RDMA
>>>> could help get the alignment right in multiple-segment cases?  But then
>>>> I think there needs to be some sort of language about RDMA, or else
>>>> we're stuck with:
>>>> 
>>>> 	https://tools.ietf.org/html/rfc5667#section-5
>>>> 
>>>> which I think forces us to return READ_PLUS data inline, another
>>>> possible READ_PLUS regression.)
>> 
>> Btw, if I understand this correctly:
>> 
>> Without a spec update, a large NFS READ_PLUS reply would be returned
>> in a reply list, which is moved via RDMA WRITE, just like READ
>> replies.
>> 
>> The difference is NFS READ payload is placed directly into the
>> client’s page cache by the adapter. With a reply list, the client
>> transport would need to copy the returned data into the page cache.
>> And a large reply buffer would be needed.
>> 
>> So, slower, yes. But not inline.
> 
> I'm not very good at this, bear with me, but: the above-referenced
> section doesn't talk about "reply lists", only "write lists", and only
> explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline.

I may have some details wrong, but this is my understanding.

Small replies are sent inline. There is a size maximum for inline
messages, however. I guess 5667 section 5 assumes this context, which
appears throughout RFC 5666.

If an expected reply exceeds the inline size, then a client will
set up a reply list for the server. A memory region on the client is
registered as a target for RDMA WRITE operations, and the co-ordinates
of that region are sent to the server in the RPC call.

If the server finds the reply will indeed be larger than the inline
maximum, it plants the reply in the client memory region described by
the request’s reply list, and repeats the co-ordinates of that region
back to the client in the RPC reply.

A server may also choose to send a small reply inline, even if the
client provided a reply list. In that case, the server does not
repeat the reply list in the reply, and the full reply appears
inline.

Linux registers part of the RPC reply buffer for the reply list. After
it is received on the client, the reply payload is copied by the client
CPU to its final destination.

Inline and reply list are the mechanisms used when the upper layer
has some processing to do to the incoming data (eg READDIR). When
a request just needs raw data to be simply dropped off in the client’s
memory, then the write list is preferred. A write list is basically a
zero-copy I/O.

But these choices are fixed by the specified RPC/RDMA binding of the
upper layer protocol (that’s what RFC 5667 is). NFS READ and READLINK
are the only NFS operations allowed to use a write list. (NFSv4
compounds are somewhat ambiguous, and that too needs to be addressed).

As READ_PLUS conveys both kinds of data (zero-copy and data that
might require some processing) IMO RFC 5667 does not provide adequate
guidance about how to convey READ_PLUS. It will need to be added
somewhere.

>>> NFSv4.2 currently does not have a binding to RPC/RDMA.
>> 
>> Right, this means a spec update is needed. I agree with you, and
>> it’s on our list.
> 
> OK, so that would go in some kind of update to 5667 rather than in the
> minor version 2 spec?

The WG has to decide whether an update to 5667 or a new document will
be the ultimate vehicle.

> Discussing this in the READ_PLUS description would also seem helpful to
> me, but OK I don't really have a strong opinion.

If there is a precedent, it’s probably that the RPC/RDMA binding is
specified in a separate document. I suspect there won’t be much
appetite for holding up NFSv4.2 for an RPC/RDMA binding.

> --b.
> 
>> 
>>> It’s hard to
>>> say at this point what a READ_PLUS on RPC/RDMA might look like.
>>> 
>>> RDMA clearly provides no advantage for moving a pattern that a
>>> client must re-inflate into data itself. I can guess that only the
>>> CONTENT_DATA case is interesting for RDMA bulk transfers.
>>> 
>>> But don’t forget that NFSv4.1 and later don’t yet work over RDMA,
>>> thanks to missing support for bi-directional RPC/RDMA. I wouldn’t
>>> worry about special cases for it at this point.
>>> 
>>> --
>>> Chuck Lever
>>> chuck[dot]lever[at]oracle[dot]com
>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>> 
>> 

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html