On Jul 24, 2015, at 12:26 PM, Jason Gunthorpe <jgunthorpe@xxxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Jul 24, 2015 at 10:36:07AM -0400, Chuck Lever wrote: > >> Unfinished, but operational: >> >> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfs-rdma-future > > Nice.. > > Can you spend some time and reflect on how some of this could be > lowered into the core code? The point of the prototype is to start thinking about this with actual data. :-) So I’m with you. > The FMR and FRWR side have many > similarities now.. >> FRWR is seeing a 10-15% throughput reduction with 8-thread dbench, >> but a 5% improvement with 16-thread fio IOPS. 4K and 8K direct >> read and write are negatively impacted. > > I'm not surprised since invalidate is sync. I belive you need to > incorporate SEND WITH INVALIDATE to substantially recover this > overhead. I tried to find another kernel ULP using SEND WITH INVALIDATE, but I didn’t see one. I assume you mean the NFS server would use this WR when replying, to knock down the RPC’s client MRs remotely? > It would be neat if the RQ could continue to advance while waiting for > the invalidate.. That looks almost doable.. The new reply handling work queue is not restricted to serial reply processing. Unlike the tasklet model, multiple RPC replies can be processed at once, and can run across all CPUs. The tasklet was global, shared across all RPC/RDMA receive queues on that client. AFAICT there is very little else that is shared between RPC replies. I think using a work queue instead may be a tiny bit slower for each RPC (perhaps due to additional context switching), but will allow much better scaling with the number of transports and mount points the client creates. I may not have understood your comment. >> I converted the RPC reply handler tasklet to a work queue context >> to allow sleeping. A new .ro_unmap_sync method is invoked after >> the RPC/RDMA header is parsed but before xprt_complete_rqst() >> wakes up the waiting RPC. > > .. so the issue is the RPC must be substantially parsed to learn which > MR it is associated with to schedule the invalidate? Only the RPC/RDMA header has to be parsed, but yes. The needed parsing is handled in rpcrdma_reply_handler right before the .ro_unmap_unsync call. Parsing the RPC reply results is then done by the upper layer once xprt_complete_rqst() has run. >> This is actually much more efficient than the current logic, >> which serially does an ib_unmap_fmr() for each MR the RPC owns. >> So FMR overall performs better with this change. > > Interesting.. > >> Because the next RPC cannot awaken until the last send completes, >> send queue accounting is based on RPC/RDMA credit flow control. > > So for FRWR the sync invalidate effectively guarentees all SQEs > related to this RPC are flushed. That seems reasonable, if the number > of SQEs and CQEs are properly sized in relation to the RPC slot count > it should be workable.. Yes, both queues are sized in rpcrdma_ep_create() according to the slot count / credit limit. > How does FMR and PHYS synchronize? We still rely on timing there. The RPC's send buffer may be re-allocated by the next RPC if that RPC wants to send a bigger request than this one. Thus there is still a tiny but non-zero risk the HCA may not be done with that send buffer. Closing that hole is still on my to-do list. >> I’m sure there are some details here that still need to be >> addressed, but this fixes the big problem with FRWR send queue >> accounting, which was that LOCAL_INV WRs would continue to >> consume SQEs while another RPC was allowed to start. > > Did you test without that artificial limit you mentioned before? Yes. No problems now, the limit is removed in the last patch in that series. > I'm also wondering about this: > >> During some other testing I found that when a completion upcall >> returns to the provider leaving CQEs still on the completion queue, >> there is a non-zero probability that a completion will be lost. > > What does lost mean? Lost means a WC in the CQ is skipped by ib_poll_cq(). In other words, I expected that during the next upcall, ib_poll_cq() would return WCs that were not processed, starting with the last one on the CQ when my upcall handler returned. I found this by intentionally having the completion handler process only one or two WCs and then return. > The CQ is edge triggered, so if you don't drain it you might not get > another timely CQ callback (which is bad), but CQEs themselves should > not be lost. I’m not sure I fully understand this problem, it might even be my misuderstanding about ib_poll_cq(). But forcing the completion upcall handler to completely drain the CQ during each upcall prevents the issue. (Note, I don’t think fixing this is a pre-requisite for the synchronous invalidate work, but it just happened to be in the patch queue). -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html