On 2019-06-27 11:00 a.m., Christoph Hellwig wrote: > It is not. (c) is fundamentally very different as it is not actually > an operation that ever goes out to the wire at all, and which is why the > actual physical address on the wire does not matter at all. > Some interfaces like NVMe have designed it in a way that it the commands > used to do this internal transfer look like (b2), but that is just their > (IMHO very questionable) interface design choice, that produces a whole > chain of problems. >From the mapping device's driver's perspective yes, but from the perspective of a submitting driver they would be the same. >>> I guess it might make sense to just have a block layer flag that (b) or >>> (c) might be contained in a bio. Then we always look up the data >>> structure, but can still fall back to (a) if nothing was found. That >>> even allows free mixing and matching of memory types, at least as long >>> as they are contained to separate bio_vec segments. >> >> IMO these three cases should be reflected in flags in the bio_vec. We'd >> probably still need a queue flag to indicate support for mapping these, >> but a flag on the bio that indicates special cases *might* exist in the >> bio_vec and the driver has to do extra work to somehow distinguish the >> three types doesn't seem useful. bio_vec flags also make it easy to >> support mixing segments from different memory types. > > So I іnitially suggested these flags. But without a pgmap we absolutely > need a lookup operation to find which phys address ranges map to which > device. And once we do that the data structure the only thing we need > is a flag saying that we need that information, and everything else > can be in the data structure returned from that lookup. Yes, you did suggest them. But what I'm trying to suggest is we don't *necessarily* need the lookup. For demonstration purposes only, a submitting driver could very roughly potentially do: struct bio_vec vec; dist = pci_p2pdma_dist(provider_pdev, mapping_pdev); if (dist < 0) { /* use regular memory */ vec.bv_addr = virt_to_phys(kmalloc(...)); vec.bv_flags = 0; } else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) { vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...); vec.bv_flags = BVEC_MAP_RESOURCE; } else { vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...); vec.bv_flags = BVEC_MAP_BUS_ADDR; } -- And a mapping driver would roughly just do: dma_addr_t dma_addr; if (vec.bv_flags & BVEC_MAP_BUS_ADDR) { if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off)) { /* case (c) */ /* program the DMA engine with bar and off */ } else { /* case (b2) */ dma_addr = vec.bv_addr; } } else if (vec.bv_flags & BVEC_MAP_RESOURCE) { /* case (b1) */ dma_addr = dma_map_resource(mapping_dev, vec.bv_addr, ...); } else { /* case (a) */ dma_addr = dma_map_page(..., phys_to_page(vec.bv_addr), ...); } The real difficulty here is that you'd really want all the above handled by a dma_map_bvec() so it can combine every vector hitting the IOMMU into a single continuous IOVA -- but it's hard to fit case (c) into that equation. So it might be that a dma_map_bvec() handles cases (a), (b1) and (b2) and the mapping driver has to then check each resulting DMA vector for pci_bus_addr_in_bar() while it is programming the DMA engine to deal with case (c). Logan