On Jul 23, 2015, at 2:53 PM, Jason Gunthorpe <jgunthorpe@xxxxxxxxxxxxxxxxxxxx> wrote: > On Thu, Jul 23, 2015 at 07:59:48PM +0300, Sagi Grimberg wrote: >> I don't mean to be negative about your ideas, I just don't think that >> doing all the work in the drivers is going to get us to a better place. > > No worries, I'm hoping someone can put the peices together and figure > out how to code share all the duplication we seem to have in the ULPs. > > The more I've look at them, the more it seems like they get basic > things wrong, like SQE accouting in NFS, dma flush ordering in NFS, I have a work-in-progress prototype that addresses both of these issues. Unfinished, but operational: http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfs-rdma-future Having this should give us time to analyze the performance impact of these changes, and to dial in an approach that aligns with your vision about the unified APIs that you and Sagi have been discussing. FRWR is seeing a 10-15% throughput reduction with 8-thread dbench, but a 5% improvement with 16-thread fio IOPS. 4K and 8K direct read and write are negatively impacted. I don’t see any significant change in client CPU utilization, but have not yet examined changes in interrupt workload, nor have I done any spin lock or CPU bus traffic analysis. But none of this is as bad as I feared it could be. There are plenty of other areas that can recoup some or all of this loss eventually. I converted the RPC reply handler tasklet to a work queue context to allow sleeping. A new .ro_unmap_sync method is invoked after the RPC/RDMA header is parsed but before xprt_complete_rqst() wakes up the waiting RPC. .ro_unmap_sync is 100% synchronous. It does not return to the reply handler until the MRs are invalid and unmapped. For FMR, .ro_unmap_sync makes a list of the RPC’s MRs and passes that list to a single ib_unmap_fmr() call, then performs DMA unmap and releases the MRs. This is actually much more efficient than the current logic, which serially does an ib_unmap_fmr() for each MR the RPC owns. So FMR overall performs better with this change. For FRWR, .ro_unmap_sync builds a chain of LOCAL_INV WRs for the RPC’s MRs and posts that with a single ib_post_send(). The final WR in the chain is signaled. A kernel completion is used to wait for the LINV chain to complete. Then DMA unmap and MR release. This lengthens per-RPC latency for FRWR, because the LINVs are now fully accounted for in the RPC round-trip rather than being done asynchronously after the RPC completes. So here performance is closer to FMR, but is still better by a substantial margin. Because the next RPC cannot awaken until the last send completes, send queue accounting is based on RPC/RDMA credit flow control. I’m sure there are some details here that still need to be addressed, but this fixes the big problem with FRWR send queue accounting, which was that LOCAL_INV WRs would continue to consume SQEs while another RPC was allowed to start. I think switching to use s/g lists will be straightforward and could simplify the overall approach somewhat. > rkey security in SRP/iSER.. > > Sharing code means we can fix those problems for good. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html