On Jun 17, 2013, at 7:01 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > On Sat, 15 Jun 2013 05:09:55 +0000 > "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote: > >>> -----Original Message----- >>> From: linux-nfs-owner@xxxxxxxxxxxxxxx [mailto:linux-nfs- >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Jeff Layton >>> Sent: Friday, June 14, 2013 3:22 PM >>> To: Sandeep Joshi >>> Cc: J. Bruce Fields; linux-nfs@xxxxxxxxxxxxxxx >>> Subject: Re: why does nfsd write not use splice >>> >>> On Fri, 14 Jun 2013 17:39:12 +0530 >>> Sandeep Joshi <sanjos100@xxxxxxxxx> wrote: >>> >>>> On Wed, Jun 12, 2013 at 10:16 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> >>> wrote: >>>>> >>>>> On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote: >>>>>> Splice can be implemented independent of RDMA. It is supposed to >>>>>> transfer pages between two file descriptors. I found some >>>>>> postings on lkml from >>>>>> 2006 where Linus says it is quite possible to splice from a socket >>>>>> to a file. >>>>>> >>>>>> See the paragraph: >>>>>> " For filesystems, splice support tends to be really easy (both >>>>>> read and write). For other things, it depends a bit. But unlike >>>>>> sendfile(), it really is quite possible to splice _from_ a socket >>>>>> too, not just _to_ a socket. But no, that case hasn't been written yet." >>>>>> http://yarchive.net/comp/linux/splice.html >>>>>> >>>>>> Larry McVoy's 1997 proposal for adding splice support to the >>>>>> kernel can be read at >>>>>> ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http:/ >>>>>> /ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz> >>>>>> >>>>>> Perhaps I should have opened this thread on lkml to determine if >>>>>> splice from socket to file is still feasible.. >>>>> >>>>> Right, the thing is, nfsd reads the rpc request from the socket into >>>>> its own buffers before it parses it. If you want to move the data >>>>> directly out of the network buffers into the page cache, then you >>>>> have to know at what point the write data starts in the >>>>> request--which I believe will mean doing the xdr parsing (and gss >>>>> decryption if necessary) as the request comes in off the wire. >>>>> >>>>> That sounds like a lot of work and even if you have someone willing >>>>> to do the work they'd also need to justify that it's worth it. >>>>> >>>>> RDMA may have some protocol support that simplifies this, I don't know. >>>>> >>>>> --b. >>>> >>>> Hi Bruce, >>>> >>>>> nfsd reads the rpc request from the socket into its own buffers before it >>> parses it. >>>> >>>> I am not intimate with the gss code but do you think the >>>> svc_rqst->rq_pages[] can be spliced ? >>>> >>> >>> Probably not in its current form. The problem is one of alignment. You need >>> to know where the write data actually starts before doing the receive off the >>> socket, so you can make sure that it ends up in the correct spot in the pages >>> you're going to splice in. >>> >>> There's also the problem of what to do about WRITE requests that contain >>> data that isn't page aligned or that's shorter than a page... >> >> Finally, there is the minor problem that the data that is actually received by the socket may be encrypted, or may need to be checksummed (krb5i) _before_ you can apply it to the file. That is not a particularly good fit for splice(). >> > > Encryption certainly can be a problem, but integrity isn't necessarily > one. > > Basically the idea would be to receive the data off the socket into a > set of pages and then splice those into the correct spot in the local > file. In both the privacy and integrity cases, you just have an extra > step in between. Privacy *may* mean an extra copy too (though some of > the crypto routines can decrypt data in place), but handling integrity > shouldn't. > > The tricky parts (I think) are determining how to lay out the received > data into the pages you eventually want to splice into the file before > you receive that data in, and how to deal with it when the WRITE > doesn't cover an entire page. Once you've copied the data one time, most of the advantage of splice() is gone, since a copy will then exist in processor cache memory and can be duplicated quickly. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html