On Wed, Nov 04, 2020 at 05:31:35PM +0100, Christoph Hellwig wrote: > On Wed, Nov 04, 2020 at 11:52:55AM -0400, Jason Gunthorpe wrote: > > It could work, I think a resonable ULP API would be to have some > > > > rdma_fill_ib_sge_from_sgl() > > rdma_map_sge_single() > > etc etc > > > > ie instead of wrappering the DMA API as-is we have a new API that > > directly builds the ib_sge. It always fills the local_dma_lkey from > > the pd, so it knows it is doing DMA from local kernel memory. > > Yeah. > > > Logically SW devices then have a local_dma_lkey MR that has an IOVA of > > the CPU physical address space, not the DMA address space as HW > > devices have. The ib_sge builders can know this detail and fill in > > addr from either a cpu phyical or a dma map. > > I don't think the builders are the right place to do it - it really > should to be in the low-level drivers for a bunch of reasons: At this point we have little choice, the ULP is responsible for map/unmap because the ULP owns the CQ and (batch) unmap is triggered by some CQE. Reworking all drivers to somehow keep track of unmaps a CQEs triggers feels so hard at this point as to be impossible. It is why the rdma_rw_ctx basically exists. So we have to keep the current arrangment, when the ib_sge is built the dma map must be conditionally done. > 1) this avoids doing the dma_map when no DMA is performed, e.g. for > mlx5 when send data is in the extended WQE At least in the kernel, the ULP has to be involved today in IB_SEND_INLINE. Userspace does an auto-copy, but that is because it doesn't have dma map/unmap. Without unmap tracking as above the caller must supply a specially formed ib_sge when using IB_SEND_INLINE that specifies the CPU vaddr so memcpy works. But this all looks like dead code, no ULP sets IB_SEND_INLINE ?? > 2) to deal with the fact that dma mapping reduces the number of SGEs. > When the system uses a modern IOMMU we'll always end up with a > single IOVA range no matter how many pages were mapped originally. > This means any MR process can actually be consolidated to use > a single SGE with the local lkey. Any place like rdma_rw_ctx_init() that decides dynamically between SGE and MR becomes a mess. It would be manageable if rdma_rw_ctx_init() was the only place doing this.. I haven't looked lately, but I wonder if it is feasible that all the MR users would use this API? Jason