On Thu, Jun 8, 2023 at 10:52 AM Ashok Raj <ashok.raj@xxxxxxxxx> wrote: > > On Thu, Jun 08, 2023 at 10:10:54AM -0700, Alexander Duyck wrote: > > On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@xxxxxxxxxxxxxxx> wrote: > > > > > > On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote: > > > > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > On 6/8/23 7:03 AM, Alexander Duyck wrote: > > > > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck > > > > > > <alexander.duyck@xxxxxxxxx> wrote: > > > > > >> > > > > > >> I am running into a DMA issue that appears to be a conflict between > > > > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is > > > > > >> supposed to create reserved regions for MSI and the memory window > > > > > >> behind the root port. However looking at reserved_regions I am not > > > > > >> seeing that. I only see the reservation for the MSI. > > > > > >> > > > > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: > > > > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions > > > > > >> 0x00000000fee00000 0x00000000feefffff msi > > > > > >> > > > > > >> Shouldn't there also be a memory window for the region behind the root > > > > > >> port to prevent any possible peer-to-peer access? > > > > > > > > > > > > Since the iommu portion of the email bounced I figured I would fix > > > > > > that and provide some additional info. > > > > > > > > > > > > I added some instrumentation to the kernel to dump the resources found > > > > > > in iova_reserve_pci_windows. From what I can tell it is finding the > > > > > > correct resources for the Memory and Prefetchable regions behind the > > > > > > root port. It seems to be calling reserve_iova which is successfully > > > > > > allocating an iova to reserve the region. > > > > > > > > > > > > However still no luck on why it isn't showing up in reserved_regions. > > > > > > > > > > Perhaps I can ask the opposite question, why it should show up in > > > > > reserve_regions? Why does the iommu subsystem block any possible peer- > > > > > to-peer DMA access? Isn't that a decision of the device driver. > > > > > > > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > > > > which is not related to peer-to-peer accesses. > > > > > > > > The problem is if the IOVA overlaps with the physical addresses of > > > > other devices that can be routed to via ACS redirect. As such if ACS > > > > redirect is enabled a host IOVA could be directed to another device on > > > > the switch instead. To prevent that we need to reserve those addresses > > > > to avoid address space collisions. > > > > Our test case is just to perform DMA to/from the host on one device on > > a switch and what we are seeing is that when we hit an IOVA that > > matches up with the physical address of the neighboring devices BAR0 > > then we are seeing an AER followed by a hot reset. > > ACS is always confusing.. Does your NIC have a DTLB? No. It is using the IOMMU for all address translation. I am also pushing back on the test being used as well. It is always possible they have implemented something incorrectly and are overrunning a buffer going into the reserved IOVA region and the overlap is just a coincidence. > If request redirect is set, and the Egress is enabled, then all > transactions should go upstream to the root-port->IOMMU before being > served. > > In my 6.0 spec its in 6.12.3 ACS Peer-to-Peer Control Interactions? > > And maybe lspci would show how things are setup in the switch? We were setting the Redirect Request only, no Egress. I agree, based on the config everything should just go upstream. However if we eliminate the switch or put things in passthrough mode the problem goes away. > > > > > Any untranslated address from a device must be forwarded to the IOMMU when > > > ACS is enabled correct?I guess if you want true p2p, then you would need > > > to map so that the hpa turns into the peer address.. but its always a round > > > trip to IOMMU. > > > > This assumes all parts are doing the Request Redirect "correctly". In > > our case there is a PCIe switch we are trying to debug and we have a > > few working theories. One concern I have is that the switch may be > > throwing an ACS violation for us using an address that matches a > > neighboring device instead of redirecting it to the upstream port. If > > we pull the switch and just run on the root complex the issue seems to > > be resolved so I started poking into the code which led me to the > > documentation pointing out what is supposed to be reserved based on > > the root complex and MSI regions. > > > > As a part of going down that rabbit hole I realized that the > > reserved_regions seems to only list the MSI reservation. However after > > digging a bit deeper it seems like there is code to reserve the memory > > behind the root complex in the IOVA but it doesn't look like that is > > visible anywhere and is the piece I am currently trying to sort out. > > What I am working on is trying to figure out if the system that is > > failing is actually reserving that memory region in the IOVA, or if > > that is somehow not happening in our test setup. > > I suspect with IOMMU, there is no need to pluck holes like we do for the > MSI. In very early code in IOMMU i vaguely recall we did that, but our > knowledge on ACS was weak. (not that has improved :-)). The hole has to do mostly with avoiding any possibility of misrouting things, or at least that was my understanding after reading it. > Knowing how the switch and root ports are setup with forwarding may help > with some clues. The easy option is maybe forcibly adding to the reserved > range may help to see if you don't see the ACS violation. > > Baolu might have some better ideas. I'm working with the team having the issue to try and verify that now. In theory it should already be reserved so I am working with them to check that. Thanks, - Alex