On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote: > > > > I don't think you can do without dma_addr_t storage. In most cases > > your can just store the dma_addr_t in the LE/BE encoded hardware > > SGL, so no extra storage should be needed though. > > RDMA (and often DRM too) generally doesn't work like that, the driver > copies the page table into the device and then the only reason to have > a dma_addr_t storage is to pass that to the dma unmap API. Optionally > eliminating long term dma_addr_t storage would be a worthwhile memory > savings for large long lived user space memory registrations. It's just kinda hard to do. For aligned IOMMU mapping you'd only have one dma_addr_t mappings (or maybe a few if P2P regions are involved), so this probably doesn't matter. For direct mappings you'd have a few, but maybe the better answer is to use THP more aggressively and reduce the number of segments. > I wrote the list as from a single IO operation perspective, so all but > 5 need to store a single IOVA range that could be stored in some > simple non-dynamic memory along with whatever HW SGLs/etc are needed. > > The point of 5 being different is because the driver has to provide a > dynamically sized list of dma_addr_t's as storage until unmap. 5 is > the only case that requires that full list. No, all cases need to store one or more ranges. > > > So are you thinking something more like a driver flow of: > > > > > > .. extent IO and get # aligned pages and know if there is P2P .. > > > dma_init_io(state, num_pages, p2p_flag) > > > if (dma_io_single_range(state)) { > > > // #2, #4 > > > for each io() > > > dma_link_aligned_pages(state, io range) > > > hw_sgl = (state->iova, state->len) > > > } else { > > > > I think what you have a dma_io_single_range should become before > > the dma_init_io. If we know we can't coalesce it really just is a > > dma_map_{single,page,bvec} loop, no need for any extra state. > > I imagine dma_io_single_range() to just check a flag in state. > > I still want to call dma_init_io() for the non-coalescing cases > because all the flows, regardless of composition, should be about as > fast as dma_map_sg is today. If all flows includes multiple non-coalesced regions that just makes things very complicated, and that's exactly what I'd want to avoid. > That means we need to always pre-allocate the IOVA in any case where > the IOMMU might be active - even on a non-coalescing flow. > > IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to > be used and we can't just call today's dma_map_page() in a loop on the > non-coalescing side and pay the overhead of Nx IOVA allocations. > > In large part this is for RDMA, were a single P2P page in a large > multi-gigabyte user memory registration shouldn't drastically harm the > registration performance by falling down to doing dma_map_page, and an > IOVA allocation, on a 4k page by page basis. But that P2P page needs to be handled very differently, as with it we can't actually use a single iova range. So I'm not sure how that is even supposed to work. If you have +-------+-----+-------+ | local | P2P | local | +-------+-----+-------+ you need at least 3 hw SGL entries, as the IOVA won't be contigous. > The other thing that got hand waved here is how does dma_init_io() > know which of the 6 states we are looking at? I imagine we probably > want to do something like: > > struct dma_io_summarize summary = {}; > for each io() > dma_io_summarize_range(&summary, io range) > dma_init_io(dev, &state, &summary); > if (state->single_range) { > } else { > } > dma_io_done_mapping(&state); <-- flush IOTLB once That's why I really just want 2 cases. If the caller guarantees the range is coalescable and there is an IOMMU use the iommu-API like API, else just iter over map_single/page. > Enhancing the single sgl case is not a big change, I think. It does > seem simplifying for the driver to not have to coalesce SGLs to detect > the single-SGL fast-path. > > > > This is not quite what you said, we split the driver flow based on > > > needing 1 HW SGL vs need many HW SGL. > > > > That's at least what I intended to say, and I'm a little curious as what > > it came across. > > Ok, I was reading the discussion more about as alignment than single > HW SGL, I think you ment alignment as implying coalescing behavior > implying single HW SGL.. Yes.