Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Jason Gunthorpe <jgg@xxxxxxxx> · Tue, 28 Jan 2025 11:16:10 -0400

On Tue, Jan 28, 2025 at 03:48:54PM +0100, Thomas Hellström wrote:
> On Tue, 2025-01-28 at 09:20 -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 28, 2025 at 09:51:52AM +0100, Thomas Hellström wrote:
> > 
> > > How would the pgmap device know whether P2P is actually possible
> > > without knowing the client device, (like calling
> > > pci_p2pdma_distance)
> > > and also if looking into access control, whether it is allowed?
> > 
> > The DMA API will do this, this happens after this patch is put on top
> > of Leon's DMA API patches. The mapping operation will fail and it
> > will
> > likely be fatal to whatever is going on.
> >  
> > get_dma_pfn_for_device() returns a new PFN, but that is not a DMA
> > mapped address, it is just a PFN that has another struct page under
> > it.
> > 
> > There is an implicit assumption here that P2P will work and we don't
> > need a 3rd case to handle non-working P2P..
> 
> OK. We will have the case where we want pfnmaps with driver-private
> fast interconnects to return "interconnect possible, don't migrate"
> whereas possibly other gpus and other devices would return
> "interconnect unsuitable, do migrate", so (as I understand it)
> something requiring a more flexible interface than this.

I'm not sure this doesn't handle that case?

Here we are talking about having DEVICE_PRIVATE struct page
mappings. On a GPU this should represent GPU local memory that is
non-coherent with the CPU, and not mapped into the CPU.

This series supports three case:

 1) pgmap->owner == range->dev_private_owner
    This is "driver private fast interconnect" in this case HMM should
    immediately return the page. The calling driver understands the
    private parts of the pgmap and computes the private interconnect
    address.

    This requires organizing your driver so that all private
    interconnect has the same pgmap->owner.

 2) The page is DEVICE_PRIVATE and get_dma_pfn_for_device() exists.
    The exporting driver has the option to return a P2P struct page
    that can be used for PCI P2P without any migration. In a PCI GPU
    context this means the GPU has mapped its local memory to a PCI
    address. The assumption is that P2P always works and so this
    address can be DMA'd from.

 3) Migrate back to CPU memory - then eveything works.

Is that not enough? Where do you want something different?

> > > but leaves any dma- mapping or pfn mangling to be done after the
> > > call to hmm_range_fault(), since hmm_range_fault() really only
> > > needs
> > > to know whether it has to migrate to system or not.
> > 
> > See above, this is already the case..
> 
> Well what I meant was at hmm_range_fault() time only consider whether
> to migrate or not. Afterwards at dma-mapping time you'd expose the
> alternative pfns that could be used for dma-mapping.

That sounds like you are talking about multipath, we are not really
ready to tackle general multipath yet at the DMA API level, IMHO.

If you are just talking about your private multi-path, then that is
already handled..

> We were actually looking at a solution where the pagemap implements
> something along
> 
> bool devmem_allowed(pagemap, client); //for hmm_range_fault
> 
> plus dma_map() and dma_unmap() methods.

This sounds like dmabuf philosophy, and I don't think we should go in
this direction. The hmm caller should always be responsible for dma
mapping and we need to improve the DMA API to make this work better,
not build side hacks like this.

You can read my feelings and reasoning on this topic within this huge thread:

https://lore.kernel.org/dri-devel/20250108132358.GP5556@xxxxxxxxxx/

> In this way you'd don't need to expose special p2p dma pages and the

Removing the "special p2p dma pages" has to be done by improving the
DMA API to understand how to map phsyical addresses without struct
page. We are working toward this, slowly.

pgmap->ops->dma_map/unmap() ideas just repeat the DMABUF mistake
of mis-using the DMA API for P2P cases. Today you cannot correctly DMA
map P2P memory without the struct page.

> interface could also handle driver-private interconnects, where
> dma_maps and dma_unmap() methods become trivial.

We already handle private interconnect.

> > > One benefit of using this alternative
> > > approach is that struct hmm_range can be subclassed by the caller
> > > and
> > > for example cache device pairs for which p2p is allowed.
> > 
> > If you want to directly address P2P non-uniformity I'd rather do it
> > directly in the core code than using a per-driver callback. Every
> > driver needs exactly the same logic for such a case.
> 
> Yeah, and that would look something like the above

No, it would look like the core HMM code calling pci distance on the
P2P page returned from get_dma_pfn_for_device() and if P2P was
impossible then proceed to option #3 fault to CPU.

> although initially we intended to keep these methods in drm
> allocator around its pagemaps, but could of course look into doing
> this directly in dev_pagemap ops.   But still would probably need
> some guidance into what's considered acceptable, and I don't think
> the solution proposed in this patch meets our needs.

I'm still not sure what you are actually trying to achieve?

Jason