Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Christoph Hellwig <hch@xxxxxx> · Wed, 6 Mar 2024 23:14:00 +0100

On Wed, Mar 06, 2024 at 01:44:56PM -0400, Jason Gunthorpe wrote:
> There is a list of interesting cases this has to cover:
> 
>  1. Direct map. No dma_addr_t at unmap, multiple HW SGLs
>  2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs
>  3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs
>  4. swiotlb single range. Only IOVA range at unmap, single HW SGL
>  5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs.
>  6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs
> 
> I think we agree that 1 and 2 should be optimized highly as they are
> the common case. That mainly means no dma_addr_t storage in either

I don't think you can do without dma_addr_t storage.  In most cases
your can just store the dma_addr_t in the LE/BE encoded hardware
SGL, so no extra storage should be needed though.

> 3 is quite similar to 1, but it has the IOVA range at unmap.

Can you explain what P2P case you mean?  The switch one with the
bus address is indeed basically the same, just with potentioally a
different offset, while the through host bridge case is the same
as a normal iommu map.

> 
> 4 is basically the same as 2 from the driver's viewpoint

I'd actually treat it the same as one.

> 5 is the slowest and has the most overhead.

and 5 could be broken into multiple 4s at least for now.  Or do you
have a different dfinition of range here?

> So are you thinking something more like a driver flow of:
> 
>   .. extent IO and get # aligned pages and know if there is P2P ..
>   dma_init_io(state, num_pages, p2p_flag)
>   if (dma_io_single_range(state)) {
>        // #2, #4
>        for each io()
> 	    dma_link_aligned_pages(state, io range)
>        hw_sgl = (state->iova, state->len)
>   } else {

I think what you have a dma_io_single_range should become before
the dma_init_io.  If we know we can't coalesce it really just is a
dma_map_{single,page,bvec} loop, no need for any extra state.

And we're back to roughly the proposal I sent out years ago.

> This is not quite what you said, we split the driver flow based on
> needing 1 HW SGL vs need many HW SGL.

That's at least what I intended to say, and I'm a little curious as what
it came across.