Re: [PATCH v2 2/4] NFSD: Add READ_PLUS support for data segments

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 6 Feb 2015 12:04:13 -0500

On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:

> 
> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> 
>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote:
>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote:
>>>>> The problem is that the typical case of all data won't use splice
>>>>> every with your patches as the 4.2 client will always send a READ_PLUS.
>>>>> 
>>>>> So we'll have to find a way to use it where it helps.  While we might be
>>>>> able to add some hacks to only use splice for the first segment I guess
>>>>> we just need to make the splice support generic enough in the long run.
>>>>> 
>>>> 
>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough.
>>> 
>>> You could also elect to never return more than one data segment as a
>>> start:
>>> 
>>>  In all situations, the
>>>  server may choose to return fewer bytes than specified by the client.
>>>  The client needs to check for this condition and handle the
>>>  condition appropriately.
>> 
>> Yeah, I think that was more-or-less what Anna's first attempt did and I
>> said "what if that means more round trips"?  The client can't anticipate
>> the short reads so it can't make up for this with parallelism.
>> 
>>> But doing any of these for a call that's really just an optimization
>>> soudns odd.  I'd really like to see an evaluation of the READ_PLUS
>>> impact on various workloads before offering it.
>> 
>> Yes, unfortunately I don't see a way to make this just an obvious win.
> 
> I don’t think a “win” is necessary. It simply needs to be no worse than
> READ for current use cases.
> 
> READ_PLUS should be a win for the particular use cases it was
> designed for (large sparsely-populated datasets). Without a
> demonstrated benefit I think there’s no point in keeping it.
> 
>> (Is there any way we could make it so with better protocol?  Maybe RDMA
>> could help get the alignment right in multiple-segment cases?  But then
>> I think there needs to be some sort of language about RDMA, or else
>> we're stuck with:
>> 
>> 	https://tools.ietf.org/html/rfc5667#section-5
>> 
>> which I think forces us to return READ_PLUS data inline, another
>> possible READ_PLUS regression.)

Btw, if I understand this correctly:

Without a spec update, a large NFS READ_PLUS reply would be returned
in a reply list, which is moved via RDMA WRITE, just like READ
replies.

The difference is NFS READ payload is placed directly into the
client’s page cache by the adapter. With a reply list, the client
transport would need to copy the returned data into the page cache.
And a large reply buffer would be needed.

So, slower, yes. But not inline.

> NFSv4.2 currently does not have a binding to RPC/RDMA.

Right, this means a spec update is needed. I agree with you, and
it’s on our list.

> It’s hard to
> say at this point what a READ_PLUS on RPC/RDMA might look like.
> 
> RDMA clearly provides no advantage for moving a pattern that a
> client must re-inflate into data itself. I can guess that only the
> CONTENT_DATA case is interesting for RDMA bulk transfers.
> 
> But don’t forget that NFSv4.1 and later don’t yet work over RDMA,
> thanks to missing support for bi-directional RPC/RDMA. I wouldn’t
> worry about special cases for it at this point.
> 
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html