Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Jason Gunthorpe <jgg@xxxxxxxx> · Thu, 7 Mar 2024 17:01:16 -0400

On Thu, Mar 07, 2024 at 04:05:05PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote:
> > > 
> > > I don't think you can do without dma_addr_t storage.  In most cases
> > > your can just store the dma_addr_t in the LE/BE encoded hardware
> > > SGL, so no extra storage should be needed though.
> > 
> > RDMA (and often DRM too) generally doesn't work like that, the driver
> > copies the page table into the device and then the only reason to have
> > a dma_addr_t storage is to pass that to the dma unmap API. Optionally
> > eliminating long term dma_addr_t storage would be a worthwhile memory
> > savings for large long lived user space memory registrations.
> 
> It's just kinda hard to do.  For aligned IOMMU mapping you'd only
> have one dma_addr_t mappings (or maybe a few if P2P regions are
> involved), so this probably doesn't matter.  For direct mappings
> you'd have a few, but maybe the better answer is to use THP
> more aggressively and reduce the number of segments.

Right, those things have all been done. 100GB of huge pages is still
using a fair amount of memory for storing dma_addr_t's.

It is hard to do perfectly, but I think it is not so bad if we focus
on the direct only case and simple systems that can exclude swiotlb
early on.

> > > > So are you thinking something more like a driver flow of:
> > > > 
> > > >   .. extent IO and get # aligned pages and know if there is P2P ..
> > > >   dma_init_io(state, num_pages, p2p_flag)
> > > >   if (dma_io_single_range(state)) {
> > > >        // #2, #4
> > > >        for each io()
> > > > 	    dma_link_aligned_pages(state, io range)
> > > >        hw_sgl = (state->iova, state->len)
> > > >   } else {
> > > 
> > > I think what you have a dma_io_single_range should become before
> > > the dma_init_io.  If we know we can't coalesce it really just is a
> > > dma_map_{single,page,bvec} loop, no need for any extra state.
> > 
> > I imagine dma_io_single_range() to just check a flag in state.
> > 
> > I still want to call dma_init_io() for the non-coalescing cases
> > because all the flows, regardless of composition, should be about as
> > fast as dma_map_sg is today.
> 
> If all flows includes multiple non-coalesced regions that just makes
> things very complicated, and that's exactly what I'd want to avoid.

I don't see how to avoid it unless we say RDMA shouldn't use this API,
which is kind of the whole point from my perspective..

I want an API that can handle all the same complexity as dma_map_sg()
without forcing the use of scatterlist. Instead "bring your own
datastructure". This is the essence of what we discussed.

An API that is inferior to dma_map_sg() is really problematic to use
with RDMA.

> > That means we need to always pre-allocate the IOVA in any case where
> > the IOMMU might be active - even on a non-coalescing flow.
> > 
> > IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
> > be used and we can't just call today's dma_map_page() in a loop on the
> > non-coalescing side and pay the overhead of Nx IOVA allocations.
> > 
> > In large part this is for RDMA, were a single P2P page in a large
> > multi-gigabyte user memory registration shouldn't drastically harm the
> > registration performance by falling down to doing dma_map_page, and an
> > IOVA allocation, on a 4k page by page basis.
> 
> But that P2P page needs to be handled very differently, as with it
> we can't actually use a single iova range.  So I'm not sure how that
> is even supposed to work.  If you have
> 
>  +-------+-----+-------+
>  | local | P2P | local |
>  +-------+-----+-------+
> 
> you need at least 3 hw SGL entries, as the IOVA won't be contigous.

Sure, 3 SGL entries is fine, that isn't what I'm pointing at

I'm saying that today if you give such a scatterlist to dma_map_sg()
it scans it and computes the IOVA space need, allocates one IOVA
space, then subdivides that single space up into the 3 HW SGLs you
show.

If you don't preserve that then we are calling, 4k at a time, a
dma_map_page() which is not anywhere close to the same outcome as what
dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
and we call into the IOVA allocator a huge number of times.

It needs to work following the same basic structure of dma_map_sg,
unfolding that logic into helpers so that the driver can provide
the data structure:

 - Scan the io ranges and figure out how much IOVA needed
   (dma_io_summarize_range)
 - Allocate the IOVA (dma_init_io)
 - Scan the io ranges again generate the final HW SGL
   (dma_io_link_page)
 - Finish the iommu batch (dma_io_done_mapping)

And you can make that pattern work for all the other cases too.

So I don't see this as particularly worse, calling some other API
instead of dma_map_page is not really a complexity on the
driver. Calling dma_init_io every time is also not a complexity. The
DMA API side is a bit more, but not substantively different logic from
what dma_map_sg already does.

Otherwise what is the alternative? How do I keep these complex things
working in RDMA and remove scatterlist?

> > The other thing that got hand waved here is how does dma_init_io()
> > know which of the 6 states we are looking at? I imagine we probably
> > want to do something like:
> > 
> >    struct dma_io_summarize summary = {};
> >    for each io()
> >         dma_io_summarize_range(&summary, io range)
> >    dma_init_io(dev, &state, &summary);
> >    if (state->single_range) {
> >    } else {
> >    }
> >    dma_io_done_mapping(&state); <-- flush IOTLB once
> 
> That's why I really just want 2 cases.  If the caller guarantees the
> range is coalescable and there is an IOMMU use the iommu-API like
> API, else just iter over map_single/page.

But how does the caller even know if it is coalescable? Other than the
trivial case of a single CPU range, that is a complicated detail based
on what pages are inside the range combined with the capability of the
device doing DMA. I don't see a simple way for the caller to figure
this out. You need to sweep every page and collect some information on
it. The above is to abstract that detail.

It was simpler before the confidential compute stuff :(

Jason