Re: [RFC PATCH 00/28] Removing struct page from P2PDMA

Christoph Hellwig <hch@xxxxxx> · Thu, 27 Jun 2019 19:00:27 +0200

On Thu, Jun 27, 2019 at 10:30:42AM -0600, Logan Gunthorpe wrote:
> >  (a) a range is normal RAM, DMA mapping works as usual
> >  (b) a range is another devices BAR, in which case we need to do a
> >      map_resource equivalent (which really just means don't bother with
> >      cache flush on non-coherent architectures) and apply any needed
> >      offset, fixed or iommu based
> 
> Well I would split this into two cases: (b1) ranges in another device's
> BAR that will pass through the root complex and require a map_resource
> equivalent and (b2) ranges in another device's bar that don't pass
> through the root complex and require applying an offset to the bus
> address. Both require rather different handling and the submitting
> driver should already know ahead of time what type we have.

True.

> 
> >  (c) a range points to a BAR on the acting device. In which case we
> >      don't need to DMA map at all, because no dma is happening but just an
> >      internal transfer.  And depending on the device that might also require
> >      a different addressing mode
> 
> I think (c) is actually just a special case of (b2). Any device that has
> a special protocol for addressing the local BAR can just do a range
> compare on the address to determine if it's local or not. Devices that
> don't have a special protocol for this would handle both (c) and (b2)
> the same.

It is not.  (c) is fundamentally very different as it is not actually
an operation that ever goes out to the wire at all, and which is why the
actual physical address on the wire does not matter at all.
Some interfaces like NVMe have designed it in a way that it the commands
used to do this internal transfer look like (b2), but that is just their
(IMHO very questionable) interface design choice, that produces a whole
chain of problems.

> > I guess it might make sense to just have a block layer flag that (b) or
> > (c) might be contained in a bio.  Then we always look up the data
> > structure, but can still fall back to (a) if nothing was found.  That
> > even allows free mixing and matching of memory types, at least as long
> > as they are contained to separate bio_vec segments.
> 
> IMO these three cases should be reflected in flags in the bio_vec. We'd
> probably still need a queue flag to indicate support for mapping these,
> but a flag on the bio that indicates special cases *might* exist in the
> bio_vec and the driver has to do extra work to somehow distinguish the
> three types doesn't seem useful. bio_vec flags also make it easy to
> support mixing segments from different memory types.

So I іnitially suggested these flags.  But without a pgmap we absolutely
need a lookup operation to find which phys address ranges map to which
device.  And once we do that the data structure the only thing we need
is a flag saying that we need that information, and everything else
can be in the data structure returned from that lookup.