Re: why does nfsd write not use splice

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Mon, 17 Jun 2013 11:48:18 +0000

On Jun 17, 2013, at 7:01 AM, Jeff Layton <jlayton@xxxxxxxxxx>
 wrote:

> On Sat, 15 Jun 2013 05:09:55 +0000
> "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote:
> 
>>> -----Original Message-----
>>> From: linux-nfs-owner@xxxxxxxxxxxxxxx [mailto:linux-nfs-
>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Jeff Layton
>>> Sent: Friday, June 14, 2013 3:22 PM
>>> To: Sandeep Joshi
>>> Cc: J. Bruce Fields; linux-nfs@xxxxxxxxxxxxxxx
>>> Subject: Re: why does nfsd write not use splice
>>> 
>>> On Fri, 14 Jun 2013 17:39:12 +0530
>>> Sandeep Joshi <sanjos100@xxxxxxxxx> wrote:
>>> 
>>>> On Wed, Jun 12, 2013 at 10:16 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx>
>>> wrote:
>>>>> 
>>>>> On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote:
>>>>>> Splice can be implemented independent of RDMA.  It is supposed to
>>>>>> transfer pages between two file descriptors.  I found some
>>>>>> postings on lkml from
>>>>>> 2006 where Linus says it is quite possible to splice from a socket
>>>>>> to a file.
>>>>>> 
>>>>>> See the paragraph:
>>>>>> " For filesystems, splice support tends to be really easy (both
>>>>>> read and write). For other things, it depends a bit. But unlike
>>>>>> sendfile(), it really is quite possible to splice _from_ a socket
>>>>>> too, not just _to_ a socket. But no, that case hasn't been written yet."
>>>>>> http://yarchive.net/comp/linux/splice.html
>>>>>> 
>>>>>> Larry McVoy's 1997 proposal for adding splice support to the
>>>>>> kernel can be read at
>>>>>> ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http:/
>>>>>> /ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz>
>>>>>> 
>>>>>> Perhaps I should have opened this thread on lkml to determine if
>>>>>> splice from socket to file is still feasible..
>>>>> 
>>>>> Right, the thing is, nfsd reads the rpc request from the socket into
>>>>> its own buffers before it parses it.  If you want to move the data
>>>>> directly out of the network buffers into the page cache, then you
>>>>> have to know at what point the write data starts in the
>>>>> request--which I believe will mean doing the xdr parsing (and gss
>>>>> decryption if necessary) as the request comes in off the wire.
>>>>> 
>>>>> That sounds like a lot of work and even if you have someone willing
>>>>> to do the work they'd also need to justify that it's worth it.
>>>>> 
>>>>> RDMA may have some protocol support that simplifies this, I don't know.
>>>>> 
>>>>> --b.
>>>> 
>>>> Hi Bruce,
>>>> 
>>>>> nfsd reads the rpc request from the socket into its own buffers before it
>>> parses it.
>>>> 
>>>> I am not intimate with the gss code but do you think the
>>>> svc_rqst->rq_pages[] can be spliced ?
>>>> 
>>> 
>>> Probably not in its current form. The problem is one of alignment. You need
>>> to know where the write data actually starts before doing the receive off the
>>> socket, so you can make sure that it ends up in the correct spot in the pages
>>> you're going to splice in.
>>> 
>>> There's also the problem of what to do about WRITE requests that contain
>>> data that isn't page aligned or that's shorter than a page...
>> 
>> Finally, there is the minor problem that the data that is actually received by the socket may be encrypted, or may need to be checksummed (krb5i) _before_ you can apply it to the file. That is not a particularly good fit for splice().
>> 
> 
> Encryption certainly can be a problem, but integrity isn't necessarily
> one.
> 
> Basically the idea would be to receive the data off the socket into a
> set of pages and then splice those into the correct spot in the local
> file. In both the privacy and integrity cases, you just have an extra
> step in between. Privacy *may* mean an extra copy too (though some of
> the crypto routines can decrypt data in place), but handling integrity
> shouldn't.
> 
> The tricky parts (I think) are determining how to lay out the received
> data into the pages you eventually want to splice into the file before
> you receive that data in, and how to deal with it when the WRITE
> doesn't cover an entire page.

Once you've copied the data one time, most of the advantage of splice() is gone, since a copy will then exist in processor cache memory and can be duplicated quickly.

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html