Bjorn, On 2/1/2023 5:44 AM, Bjorn Helgaas wrote: > On Tue, Jan 31, 2023 at 08:56:13PM +0800, Baolu Lu wrote: >> On 2023/1/31 2:38, Bjorn Helgaas wrote: >>>> PCI: Add translated request only flag for pci_enable_pasid() >>>> >>>> The PCIe fabric routes Memory Requests based on the TLP address, ignoring >>>> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: >>>> Enable PASID only when ACS RR & UF enabled on upstream path") requires >>>> some ACS features being supported on device's upstream path when enabling >>>> PCI/PASID. > > Looking up 201007ef707a to see what ensuring system integrity means, > it prevents Memory Requests with PASID, which should always be routed > to the RC, from being mistakenly routed as peer-to-peer requests. > >>>> However, above change causes the Linux kernel boots to black screen on a >>>> system with below graphic device: >>> >>> We need a PCIe concept-level description of the issue first, i.e., in >>> terms of DMA, PASID, ACS, etc. Then we can mention the AMD GPU issue >>> as an instance. >> >> How about below description? > > Thanks, this is exactly the sort of thing I'm looking for. But my > understanding of ATS/PRI/PASID is weak, so I'm still working through > this. Tell me when I say something wrong below... > >> PCIe endpoints can use ATS to request DMA remapping hardware to >> translate an IOVA to its mapped physical address. If the translation is >> missing or the permissions are insufficient, the PRI is used to trigger >> an I/O page fault. The IOMMU driver will fill the mapping with desired >> permissions and return the translated address to the device. > > In PCIe spec language, I think you're saying that a PCIe Function may > contain an ATC. If the ATC Capability Enable bit is set, the Function > can issue Translation Requests. > > The TA (aka IOMMU) will respond with a Translation Completion. If the > Completion is a CplD, it contains the translated address and the > Function can store the entry in its ATC. I assume the I/O page fault > case corresponds to a Cpl (with no data) meaning that the TA could not > translate the address. > > If the TA doesn't have a mapping with the desired permissions, and the > Function's Page Request Capability Enable bit is set, it may issue a > Page Request Message. It's up to the TA/IOMMU to make this message > visible to the OS, which can make the page resident, create an IOMMU > mapping, and enable a PRG Response Message. After the Function > receives the PRG Response Message, it would issue another Translation > Request. > >> The translated address is specified by the IOMMU driver. The IOMMU >> driver ensures that the address is a DMA buffer address instead of any >> P2P address in the PCI fabric. Therefore, any translated memory request >> will eventually be routed to IOMMU regardless of whether there is ACS >> control in the up-streaming path. > > A Memory Request with an address that is not a P2P address, i.e., it > is not contained in any bridge aperture, will *always* be routed > toward the RC, won't it? Isn't that the case regardless of whether > the address is translated or untranslated, and even regardless of ACS? > > IIUC, ACS basically causes peer-to-peer requests to be routed upstream > instead of directly to the peer. > > OK, reading this again, I realize that I just restated exactly what > you had already written, sorry about that. > >> AMD GPU is one of those devices. > > I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities? > And furthermore, that the GPU *always* uses Translated addresses with > PASID? > > So I guess what's going on here is that if: > > - A device only uses PASID with Translated addresses, and > - those Translated addresses are never P2P addresses, then > - those transactions will always be routed to the RC. > > And this applies even if there is no ACS or ACS doesn't support > PCI_ACS_RR and PCI_ACS_UF. > > The black screen happens because ... ? > > What can we include in the commit log to help people find this fix? I > see these in the bugzilla: > > WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 > WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 > > (These look like defects in pdev_pri_ats_enable(), so really just > distractions) Right. We have fixed error handling path in this function. Joerg has queued the fix. > > kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874 > kfd kfd: amdgpu: device 1002:9874 NOT added due to errors > BUG: kernel NULL pointer dereference, address: 0000000000000058 > RIP: 0010:report_iommu_fault+0x11/0x90 > > I couldn't figure out the NULL pointer dereference. I expected it to > be from a BUG() or similar in report_iommu_fault(), but I don't see > that. Its coming from below path : - During system boot IOMMU allocates default domain - AMD IOMMU v2 module (iommu_v2) created another domain and tried to attach devices to new domain. - In device attachment path (amd_iommu_attach_device()) it first detaches device from current domain and tries to attach device to new domain. Here device attachment failed as PASID enable check failed. - We didn't recover from above failure (I have proposed fix for this [1]). - So device to domain attachment is not in consistent state. - Device tried to do DMA and hit IO fault. Above NULL pointer derefence is coming from that path as dev to domain setup is not proper. [1] https://lore.kernel.org/linux-iommu/20230113135956.5788-1-vasant.hegde@xxxxxxx/T/#t -Vasant