Re: [PATCH v1 00/17] Provide a new two step DMA mapping API

Christoph Hellwig <hch@xxxxxx> · Mon, 4 Nov 2024 10:58:31 +0100

On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:
> The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
> entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
> being able to linearise a scatterlist in DMA address space. TBH I doubt 
> there are many actual scatter-gather-capable devices with significant 
> enough limitations to meaningfully benefit from DMA segment combining these 
> days - I've often thought that by now it might be a good idea to turn that 
> behaviour off by default and add an attribute for callers to explicitly 
> request it.

Even when devices are not limited they often perform significantly better
when IOVA space is not completely fragmented.  While the dma_map_sg code
is a bit gross due to the fact that it has to deal with unaligned segments,
the coalescing itself often is a big win.

Note that dma_map_sg also has two other very useful features:  batching
of the iotlb flushing, and support for P2P, which to be efficient also
requires batching the lookups.

>> This uniqueness has been a long standing pain point as the scatterlist API
>> is mandatory, but expensive to use.
>
> Huh? When and where has anything ever called it mandatory? Nobody's getting 
> sent to DMA jail for open-coding:

You don't get sent to jail.  But you do not get batched iotlb sync, you
don't get properly working P2P, and you don't get IOVA coalescing.

>> Several approaches have been explored to expand the DMA API with additional
>> scatterlist-like structures (BIO, rlist), instead split up the DMA API
>> to allow callers to bring their own data structure.
>
> And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
> those two notions in any way related? We literally already have one generic 
> DMA operation which doesn't operate on struct page, yet needed nothing 
> "split up" to be possible.

Yeah, I don't really get the struct page argument.  In fact if we look
at the nitty-gritty details of dma_map_page it doesn't really need a
page at all.  I've been looking at cleaning some of this up and providing
a dma_map_phys/paddr which would be quite handy in a few places.  Note
because we don't have a struct page for the memory, but because converting
to/from it all the time is not very efficient.

>>   2. VFIO PCI live migration code is building a very large "page list"
>>      for the device. Instead of allocating a scatter list entry per allocated
>>      page it can just allocate an array of 'struct page *', saving a large
>>      amount of memory.
>
> VFIO already assumes a coherent device with (realistically) an IOMMU which 
> it explicitly manages - why is it even pretending to need a generic DMA 
> API?

AFAIK that does isn't really vfio as we know it but the control device
for live migration.  But Leon or Jason might fill in more.

The point is that quite a few devices have these page list based APIs
(RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs).

>
>>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
>>      list without having to allocate then populate an intermediate SG table.
>
> As above, given that a bio_vec still deals in struct pages, that could 
> seemingly already be done by just mapping the pages, so how is it proving 
> any benefit of a fragile new interface?

Because we only need to preallocate the tiny constant sized dma_iova_state
as part of the request instead of an additional scatterlist that requires
sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
per segment, including a memory allocation per I/O for that.

> My big concern here is that a thin and vaguely-defined wrapper around the 
> IOMMU API is itself a step which smells strongly of "abuse and design 
> mistake", given that the basic notion of allocating DMA addresses in 
> advance clearly cannot generalise. Thus it really demands some considered 
> justification beyond "We must do something; This is something; Therefore we 
> must do this." to be convincing.

At least for the block code we have a nice little core wrapper that is
very easy to use, and provides a great reduction of memory use and
allocations.  The HMM use case I'll let others talk about.