On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote: > The hilarious amount of work that iommu_dma_map_sg() does is pretty much > entirely for the benefit of v4l2 and dma-buf importers who *depend* on > being able to linearise a scatterlist in DMA address space. TBH I doubt > there are many actual scatter-gather-capable devices with significant > enough limitations to meaningfully benefit from DMA segment combining these > days - I've often thought that by now it might be a good idea to turn that > behaviour off by default and add an attribute for callers to explicitly > request it. Even when devices are not limited they often perform significantly better when IOVA space is not completely fragmented. While the dma_map_sg code is a bit gross due to the fact that it has to deal with unaligned segments, the coalescing itself often is a big win. Note that dma_map_sg also has two other very useful features: batching of the iotlb flushing, and support for P2P, which to be efficient also requires batching the lookups. >> This uniqueness has been a long standing pain point as the scatterlist API >> is mandatory, but expensive to use. > > Huh? When and where has anything ever called it mandatory? Nobody's getting > sent to DMA jail for open-coding: You don't get sent to jail. But you do not get batched iotlb sync, you don't get properly working P2P, and you don't get IOVA coalescing. >> Several approaches have been explored to expand the DMA API with additional >> scatterlist-like structures (BIO, rlist), instead split up the DMA API >> to allow callers to bring their own data structure. > > And this line of reasoning is still "2 + 2 = Thursday" - what is to say > those two notions in any way related? We literally already have one generic > DMA operation which doesn't operate on struct page, yet needed nothing > "split up" to be possible. Yeah, I don't really get the struct page argument. In fact if we look at the nitty-gritty details of dma_map_page it doesn't really need a page at all. I've been looking at cleaning some of this up and providing a dma_map_phys/paddr which would be quite handy in a few places. Note because we don't have a struct page for the memory, but because converting to/from it all the time is not very efficient. >> 2. VFIO PCI live migration code is building a very large "page list" >> for the device. Instead of allocating a scatter list entry per allocated >> page it can just allocate an array of 'struct page *', saving a large >> amount of memory. > > VFIO already assumes a coherent device with (realistically) an IOMMU which > it explicitly manages - why is it even pretending to need a generic DMA > API? AFAIK that does isn't really vfio as we know it but the control device for live migration. But Leon or Jason might fill in more. The point is that quite a few devices have these page list based APIs (RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs). > >> 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter >> list without having to allocate then populate an intermediate SG table. > > As above, given that a bio_vec still deals in struct pages, that could > seemingly already be done by just mapping the pages, so how is it proving > any benefit of a fragile new interface? Because we only need to preallocate the tiny constant sized dma_iova_state as part of the request instead of an additional scatterlist that requires sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int) per segment, including a memory allocation per I/O for that. > My big concern here is that a thin and vaguely-defined wrapper around the > IOMMU API is itself a step which smells strongly of "abuse and design > mistake", given that the basic notion of allocating DMA addresses in > advance clearly cannot generalise. Thus it really demands some considered > justification beyond "We must do something; This is something; Therefore we > must do this." to be convincing. At least for the block code we have a nice little core wrapper that is very easy to use, and provides a great reduction of memory use and allocations. The HMM use case I'll let others talk about.