Re: [PATCH v2 4/4] vfio/pci: Allow MMIO regions to be exported through dma-buf

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 7 Sep 2022 12:23:28 -0300

On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
> > Yes, you said that, and I said that when the AMD driver first merged
> > it - but it went in anyhow and now people are using it in a bunch of
> > places.
> 
> drm folks made up their own weird rules, if they internally stick
> to it they have to listen to it given that they ignore review comments,
> but it violates the scatterlist API and has not business anywhere
> else in the kernel.  And yes, there probably is a reason or two why
> the drm code is unusually error prone.

That may be, but it is creating problems if DRM gets to do X crazy
thing and nobody else can..

So, we have two issues here

 1) DMABUF abuses the scatter list, but this is very constrainted we have
    this weird special "DMABUF scatterlist" that is only touched by DMABUF
    importers. The imports signal that they understand the format with
    a flag. This is ugly and would be nice to clean to a dma mapped
    address list of some sort.

    I spent alot of time a few years ago removing driver touches of
    the SGL and preparing the RDMA stack to do this kind of change, at
    least.

 2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
    certain special cases.

    Rather than jump to ZONE_DEVICE and map_sgl I would like to
    continue to support non-struct page mapping. So, I would suggest
    adding a dma_map_p2p() that can cover off the special cases,
    include the two struct devices as arguments with a physical PFN/size. Do
    the same logic we do under the ZONE_DEVICE stuff.

    Drivers can choose if they want to pay the memory cost of
    ZONE_DEVICE and get faster dma_map or get slower dma_map and save
    memory.

I still think we can address them incrementally - but the
dma_map_p2p() might be small enough to sort out right away, if you are
OK with it.

> > > Why would small BARs be problematic for the pages?  The pages are more
> > > a problem for gigantic BARs do the memory overhead.
> > 
> > How do I get a struct page * for a 4k BAR in vfio?
> 
> I guess we have different definitions of small then :)
> 
> But unless my understanding of the code is out out of data,
> memremap_pages just requires the (virtual) start address to be 2MB
> aligned, not the size.  Adding Dan for comments.

Don't we need the virtual start address to equal the physical pfn for
everything to work properly? eg pfn_to_page?

And we can't over-allocate because another driver might want to also
use ZONE_DEVICE pages for its BAR that is now creating a collision.

So, at least as is, the memmap stuff seems unable to support the case
we have with VFIO.

> That being said, what is the point of mapping say a 4k BAR for p2p?
> You're not going to save a measurable amount of CPU overhead if that
> is the only place you transfer to.

For the purpose this series is chasing, it is for doorbell rings. The
actual data transfer may still bounce through CPU memory (if a CMB is
not available), but the latency reduction of directly signaling the
peer device that the transfer is ready is the key objective. 

Bouncing an interrupt through the CPU to cause it to do a writel() is
very tiem consuming, especially on slow ARM devices, while we have
adequate memory bandwidth for data transfer.

When I look at iommufd, it is for generality and compat. We don't have
knowledge of what the guest will do, so regardless of BAR size we have
to create P2P iommu mappings for every kind of PCI BAR. It is what
vfio is currently doing.

Jason