On Wed, Apr 03, 2024 at 04:06:11PM +0200, Christian König wrote: [UGH html emails, try to avoid those they don't get archived!] > The problem with that isn't the software but the hardware. > At least on the AMD GPUs and Intels Xe accelerators we have seen so far > page faults are not fast enough to actually work with the semantics the > Linux kernel uses for struct pages. > That's why for example the SVM implementation really suck with fork(), > the transparent huge page deamon and NUMA migrations. > Somebody should probably sit down and write a performance measurement > tool for page faults so that we can start to compare vendors regarding > this. Yes, all these page fault implementations I've seen are really slow. Even SVA/PRI is really slow. The only way it works usefully today is for the application/userspace environment to co-operate and avoid causing faults. Until someone invents a faster PRI interface this is what we have.. It is limited but still useful. > The problem is the DMA API currently has no idea of inter device > connectors like XGMI. > So it can create P2P mappings for PCIe, but anything which isn't part > of those interconnects is ignore at the moment as far as I can see. Speaking broadly - a "multi-path" device is one that has multiple DMA initiators and thus multiple paths the DMA can travel. The different paths may have different properties, like avoiding the iommu or what not. This might be a private hidden bus (XGMI/nvlink/etc) in a GPU complex or just two PCI end ports on the same chip like a socket direct mlx5 device. The device HW itself must have a way to select which path each DMA goes thorugh because the paths are going to have different address spaces. A multi-path PCI device will have different PCI RID's and thus different iommu_domains/IO pagetables/IOVAs, for instance. A GPU will alias its internal memory with the PCI IOMMU IOVA. So, in the case of something like a GPU I expect the private PTE itself to have bit(s) indicating if the address is PCI, local memory or internal interconnect. When the hmm_range_fault() encounters a DEVICE_PRIVATE page the GPU driver must make a decision on how to set that bit. My advice would be to organize the GPU driver so that the "dev_private_owner" is the same value for all GPU's that share a private address space. IOW dev_private_owner represents the physical *address space* that the DEVICE_PRIVATE's hidden address lives in, not the owning HW. Perhaps we will want to improve on this by adding to the pgmap an explicit address space void * private data as well. When setup like this hmm_range_fault() will naturally return DEVICE_PRIVATE pages which map to the address space for which the requesting GPU can trivially set the PTE bit on. Easy. No DMA API fussing needed. Otherwise hmm_range_fault() returns the CPU/P2P page. The GPU should select the PCI path and the DMA API will check the PCI topology and generate a correct PCI address. If the device driver needs/wants to create driver core bus's and devices to help it model and discover the dev_private_owner groups, I don't know. Clearly the driver must be able to do this grouping to make it work, and all this setup is just done when creating the pgmap. I don't think the DMA API should become involved here. The layering in a multi-path scenario should have the DMA API caller decide on the path then the DMA API will map for the specific path. The caller needs to expressly opt into this because there is additional HW - the multi-path selector - that needs to be programmed and the DMA API cannot make that transparent. A similar approach works for going from P2P pages as well, the driver can inspect the pgmap owner and similarly check the pgmap private data to learn the address space and internal address then decide to choose the non-PCI path. This scales to a world without P2P struct pages because we will still have some kind of 'pgmap' similar structure that holds meta data for a uniform chunk of MMIO. Jason