On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote: > On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote: > > 2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in > > certain special cases. > > Not just certain special cases, but one of the main use cases. > Basically P2P can happen in two ways: > > a) through a PCIe switch, or > b) through connected root ports Yes, we tested both, both work. > The open code version here only supports a), only supports it when there > is no offset between the 'phyiscal' address of the BAR seen PCIe > endpoint and the Linux way. x86 usually (always?) doesn't have an > offset there, but other architectures often do. The PCI offset is some embedded thing - I've never seen it in a server platform. Frankly, it is just bad SOC design and there is good reason why non-zero needs to be avoided. As soon as you create aliases between the address spaces you invite trouble. IIRC a SOC I used once put the memory at 0 -> 4G then put the only PCI aperture at 4g -> 4g+N. However this design requires 64 bit PCI support, which at the time, the platform didn't have. So they used PCI offset to hackily alias the aperture over the DDR. I don't remember if they threw out a bit of DDR to resolve the alias, or if they just didn't support PCI switches. In any case, it is a complete mess. You either drastically limit your BAR size, don't support PCI switches or loose a lot of DDR. I also seem to remember that iommu and PCI offset don't play nice together - so for the VFIO use case where the iommu is present I'm pretty sure we can very safely assume 0 offset. That seems confirmed by the fact that VFIO has never handled PCI offset in its own P2P path and P2P works fine in VMs across a wide range of platforms. That said, I agree we should really have APIs that support this properly, and dma_map_resource is certainly technically wrong. So, would you be OK with this series if I try to make a dma_map_p2p() that resolves the offset issue? > Last but not least I don't really see how the code would even work > when an IOMMU is used, as dma_map_resource will return an IOVA that > is only understood by the IOMMU itself, and not the other endpoint. I don't understand this. __iommu_dma_map() will put the given phys into the iommu_domain associated with 'dev' and return the IOVA it picked. Here 'dev' is the importing device, it is the device that will issue the DMA: + dma_addr = dma_map_resource( + attachment->dev, + pci_resource_start(priv->vdev->pdev, priv->index) + + priv->offset, + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); eg attachment->dev is the PCI device of the RDMA device, not the VFIO device. 'phys' is the CPU physical of the PCI BAR page, which with 0 PCI offset is the right thing to program into the IO page table. > How was this code even tested? It was tested on a few platforms, like I said above, the cases where it doesn't work are special, largely embedded, and not anything we have in our labs - AFAIK. Jason