Re: Cross-device and cross-driver HMM support

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 3 Apr 2024 12:09:28 -0300

On Wed, Apr 03, 2024 at 04:06:11PM +0200, Christian König wrote:

[UGH html emails, try to avoid those they don't get archived!]

>    The problem with that isn't the software but the hardware.
>    At least on the AMD GPUs and Intels Xe accelerators we have seen so far
>    page faults are not fast enough to actually work with the semantics the
>    Linux kernel uses for struct pages.
>    That's why for example the SVM implementation really suck with fork(),
>    the transparent huge page deamon and NUMA migrations.
>    Somebody should probably sit down and write a performance measurement
>    tool for page faults so that we can start to compare vendors regarding
>    this.

Yes, all these page fault implementations I've seen are really
slow. Even SVA/PRI is really slow. The only way it works usefully
today is for the application/userspace environment to co-operate and
avoid causing faults.

Until someone invents a faster PRI interface this is what we have.. It
is limited but still useful.

>    The problem is the DMA API currently has no idea of inter device
>    connectors like XGMI.
>    So it can create P2P mappings for PCIe, but anything which isn't part
>    of those interconnects is ignore at the moment as far as I can see.

Speaking broadly - a "multi-path" device is one that has multiple DMA
initiators and thus multiple paths the DMA can travel. The different
paths may have different properties, like avoiding the iommu or what
not. This might be a private hidden bus (XGMI/nvlink/etc) in a GPU
complex or just two PCI end ports on the same chip like a socket
direct mlx5 device.

The device HW itself must have a way to select which path each DMA
goes thorugh because the paths are going to have different address
spaces. A multi-path PCI device will have different PCI RID's and thus
different iommu_domains/IO pagetables/IOVAs, for instance. A GPU will
alias its internal memory with the PCI IOMMU IOVA.

So, in the case of something like a GPU I expect the private PTE
itself to have bit(s) indicating if the address is PCI, local memory
or internal interconnect.

When the hmm_range_fault() encounters a DEVICE_PRIVATE page the GPU
driver must make a decision on how to set that bit.

My advice would be to organize the GPU driver so that the
"dev_private_owner" is the same value for all GPU's that share a
private address space. IOW dev_private_owner represents the physical
*address space* that the DEVICE_PRIVATE's hidden address lives in, not
the owning HW. Perhaps we will want to improve on this by adding to
the pgmap an explicit address space void * private data as well.

When setup like this hmm_range_fault() will naturally return
DEVICE_PRIVATE pages which map to the address space for which the
requesting GPU can trivially set the PTE bit on. Easy. No DMA API
fussing needed.

Otherwise hmm_range_fault() returns the CPU/P2P page. The GPU should
select the PCI path and the DMA API will check the PCI topology and
generate a correct PCI address.

If the device driver needs/wants to create driver core bus's and
devices to help it model and discover the dev_private_owner groups, I
don't know. Clearly the driver must be able to do this grouping to
make it work, and all this setup is just done when creating the pgmap.

I don't think the DMA API should become involved here. The layering in
a multi-path scenario should have the DMA API caller decide on the
path then the DMA API will map for the specific path. The caller needs
to expressly opt into this because there is additional HW - the
multi-path selector - that needs to be programmed and the DMA API
cannot make that transparent.

A similar approach works for going from P2P pages as well, the driver
can inspect the pgmap owner and similarly check the pgmap private data
to learn the address space and internal address then decide to choose
the non-PCI path.

This scales to a world without P2P struct pages because we will still
have some kind of 'pgmap' similar structure that holds meta data for a
uniform chunk of MMIO.

Jason