Re: [RFC PATCH 00/28] Removing struct page from P2PDMA

Logan Gunthorpe <logang@xxxxxxxxxxxx> · Thu, 27 Jun 2019 12:00:35 -0600

On 2019-06-27 11:00 a.m., Christoph Hellwig wrote:
> It is not.  (c) is fundamentally very different as it is not actually
> an operation that ever goes out to the wire at all, and which is why the
> actual physical address on the wire does not matter at all.
> Some interfaces like NVMe have designed it in a way that it the commands
> used to do this internal transfer look like (b2), but that is just their
> (IMHO very questionable) interface design choice, that produces a whole
> chain of problems.

>From the mapping device's driver's perspective yes, but from the
perspective of a submitting driver they would be the same.

>>> I guess it might make sense to just have a block layer flag that (b) or
>>> (c) might be contained in a bio.  Then we always look up the data
>>> structure, but can still fall back to (a) if nothing was found.  That
>>> even allows free mixing and matching of memory types, at least as long
>>> as they are contained to separate bio_vec segments.
>>
>> IMO these three cases should be reflected in flags in the bio_vec. We'd
>> probably still need a queue flag to indicate support for mapping these,
>> but a flag on the bio that indicates special cases *might* exist in the
>> bio_vec and the driver has to do extra work to somehow distinguish the
>> three types doesn't seem useful. bio_vec flags also make it easy to
>> support mixing segments from different memory types.
> 
> So I іnitially suggested these flags.  But without a pgmap we absolutely
> need a lookup operation to find which phys address ranges map to which
> device.  And once we do that the data structure the only thing we need
> is a flag saying that we need that information, and everything else
> can be in the data structure returned from that lookup.

Yes, you did suggest them. But what I'm trying to suggest is we don't
*necessarily* need the lookup. For demonstration purposes only, a
submitting driver could very roughly potentially do:

struct bio_vec vec;
dist = pci_p2pdma_dist(provider_pdev, mapping_pdev);
if (dist < 0) {
     /* use regular memory */
     vec.bv_addr = virt_to_phys(kmalloc(...));
     vec.bv_flags = 0;
} else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) {
     vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...);
     vec.bv_flags = BVEC_MAP_RESOURCE;
} else {
     vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...);
     vec.bv_flags = BVEC_MAP_BUS_ADDR;
}

-- And a mapping driver would roughly just do:

dma_addr_t dma_addr;
if (vec.bv_flags & BVEC_MAP_BUS_ADDR) {
     if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off))  {
          /* case (c) */
          /* program the DMA engine with bar and off */
     } else {
          /* case (b2) */
          dma_addr = vec.bv_addr;
     }
} else if (vec.bv_flags & BVEC_MAP_RESOURCE) {
     /* case (b1) */
     dma_addr = dma_map_resource(mapping_dev, vec.bv_addr, ...);
} else {
     /* case (a) */
     dma_addr = dma_map_page(..., phys_to_page(vec.bv_addr), ...);
}

The real difficulty here is that you'd really want all the above handled
by a dma_map_bvec() so it can combine every vector hitting the IOMMU
into a single continuous IOVA -- but it's hard to fit case (c) into that
equation. So it might be that a dma_map_bvec() handles cases (a), (b1)
and (b2) and the mapping driver has to then check each resulting DMA
vector for pci_bus_addr_in_bar() while it is programming the DMA engine
to deal with case (c).

Logan