Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 15 Jan 2025 09:34:19 -0400

On Wed, Jan 15, 2025 at 10:32:34AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote:
> > I think for 90% of exporters pfn would fit, but there's some really funny
> > ones where you cannot get a cpu pfn by design. So we need to keep the
> > pfn-less interfaces around. But ideally for the pfn-capable exporters we'd
> > have helpers/common code that just implements all the other interfaces.
> 
> There is no way to have dma address without a PFN in Linux right now.
> How would you generate them?  That implies you have an IOMMU that can
> generate IOVAs for something that doesn't have a physical address at
> all.
> 
> Or do you mean some that don't have pages associated with them, and
> thus have pfn_valid fail on them?  They still have a PFN, just not
> one that is valid to use in most of the Linux MM.

He is talking about private interconnect hidden inside clusters of
devices.

Ie the system may have many GPUs and those GPUs have their own private
interconnect between them. It is not PCI, and packets don't transit
through the CPU SOC at all, so the IOMMU is not involved.

DMA can happen on that private interconnect, but from a Linux
perspective it is not DMA API DMA, and the addresses used to describe
it are not part of the CPU address space. The initiating device will
have a way to choose which path the DMA goes through when setting up
the DMA.

Effectively if you look at one of these complex GPU systems you will
have a physical bit of memory, say HBM memory located on the GPU. Then
from an OS perspective we have a whole bunch of different
representations/addresses of that very same memory. A Grace/Hopper
system would have at least three different addresses (ZONE_MOVABLE, a
PCI MMIO aperture, and a global NVLink address). Each different
address effectively represents a different physical interconnect
multipath, and an initiator may have three different routes/addresses
available to reach the same physical target memory.

Part of what DMABUF needs to do is pick which multi-path will be used
between expoter/importer.

So, the hack today has the DMABUF exporter GPU driver understand the
importer is part of the private interconnect and then generate a
scatterlist with a NULL sg_page, but a sg_dma_addr that encodes the
private global address on the hidden interconnect. Somehow the
importer knows this has happened and programs its HW to use the
private path.

Jason