On Wed, Oct 19, 2016 at 02:28:40PM -0400, Chuck Lever wrote: > > > On Oct 19, 2016, at 1:26 PM, bfields@xxxxxxxxxxxx wrote: > > > > On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote: > >> I'm studying the way that the ->recvfrom and ->sendto calls work > >> for RPC-over-RDMA. > >> > >> The ->sendto path moves pages out of the svc_rqst before posting > >> I/O (RDMA Write and Send). Once the work is posted, ->sendto > >> returns, and looks like svc_rqst is released at that point. The > >> subsequent completion of the Send then releases those moved pages. > >> > >> I'm wondering if the transport can be simplified: instead of > >> moving pages around, ->sendto could just wait until the Write and > >> Send activity is complete, then return. The upper layer then > >> releases everything. > > > > I don't understand what problem you're trying to fix. Is "moving pages > > around" really especially complicated or expensive? > > Moving pages is complicated and undocumented, and adds host CPU and > memory costs (loads and stores). There is also a whole lot of page > allocator traffic in this code, and that will be enabling and > disabling interrupts and bottom-halves with some frequency. I thought "moving pages around" here is basically just this, from isvc_rdma_sendto.c:send_reply(): pages = rqstp->rq_next_page - rqstp->rq_respages; for (page_no = 0; page_no < pages; page_no++) { ctxt->pages[page_no+1] = rqstp->rq_respages[page_no]; ... rqstp->rq_respages[page_no] = NULL; ... } So we're just copying an array of page pointers from one place to another, and zeroing out the source array. For a short reply that could be 1 page or even none. In the worst case (a 1MB read result) that could be 256 8-byte pointers, so 2K. Am I missing something? Has that up-to-2K operation been a problem? --b. > > The less involvement by the host CPU the better. The current RDMA > transport code documents the RPC-over-RDMA protocol to some extent, > but the need for moving the pages is to the reader's imagination. > > From my code audit: > > The svc_rqst is essentially a static piece of memory, and the calling > NFSD kthread will re-use it immediately when svc_sendto() returns. It > cannot be manipulated asynchronously. The recvfrom calls and the > sendto call for one RPC request can each use a different svc_rqst, > for example. So if the transport needs to use pages for longer the > length of one svc_sendto call (for example, to wait for RDMA Write to > complete) then it has to move those pages out of the svc_rqst before > it returns. (Let me know if I've got this wrong). > > I'm falling back to looking at ways to clean up and document the > movement of these pages so it is better understood why it is done. > There also could be some opportunity to share page movement helpers > between the RDMA read and write path (and perhaps the TCP transport > too). > > HTH. > > > > --b. > > > >> > >> Another option would be for ->sendto to return a value that means > >> the transport will release the svc_rqst and pages. > >> > >> Or, the svc_rqst could be reference counted. > >> > >> Anyone have thoughts about this? > > > -- > Chuck Lever > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html