Hello! You asked me in another email for comments to this email, so I'm replying directly to this email... On Tuesday 04 January 2022 10:02:18 Stefan Roese wrote: > Hi, > > I'm trying to get the Kernel PCIe AER infrastructure to work on my > ZynqMP based system. E.g. handle the events (correctable, uncorrectable > etc). In my current tests, no AER interrupt is generated though. I'm > currently using the "surprise down error status" in the uncorrectable > error status register of the connected PCIe switch (PLX / Broadcom > PEX8718). Here the bit is correctly logged in the PEX switch > uncorrectable error status register but no interrupt is generated > to the root-port / system. And hence no AER message(s) reported. > > Does any one of you have some ideas on what might be missing? Why are > these events not reported to the PCIe rootport driver via IRQ? Might > this be a problem of the missing MSI-X support of the ZynqMP? The AER > interrupt is connected as legacy IRQ: > > cat /proc/interrupts | grep -i aer > 58: 0 0 0 0 nwl_pcie:legacy 0 Level > PCIe PME, aerdrv Error events (correctable, non-fatal and fatal) are reported by PCIe devices to the Root Complex via PCIe error messages (Message code of TLP is set to Error Message) and not via interrupts. Root Port is then responsible to "convert" these PCIe error messages to MSI(X) interrupt and report it to the system. According to PCIe spec, AER is supported only via MSI(X) interrupts, not legacy INTx. Via Bridge Control register (SERR# enable bit) on the Root Port it is possible to enable / disable reporting of these errors from PCIe devices on the other end of PCIe link to the system. Then via Command register (SERR# enable bit) and Device Control register it is possible to enable / disable reporting of all errors (from Root Port and also devices on other end of the link). And via AER registers on the Root Port it is also possible to disable generating MSI(X) interrupts when error is reported. And IIRC via PCIe Downstream Port Containment there is also way how to "mask" reporting of error events. But I do not have PCIe devices with DPC support, so I have not played with it yet. So there are many places where error event can be stopped. But important is that kernel AER driver should correctly enable all required bits to start receiving MSI(X) interrupts for error events. On other devices I'm seeing following issues... Root Ports are not compliant to PCIe spec and do not implement error reporting at all. Or they do not implement those enable/disable bits correctly. Or they do not implement proper support for extended PCIe config space for Root Port (AER is in extended space). Or they report error events via custom proprietary interrupts and not via MSI(X) as required by PCIe spec. This is the case for (all?) Marvell PCIe controllers and I saw here on linux-pci list that it applies also for PCIe controllers from some other vendors. Also drivers for Marvell PCIe controllers requires additional code to access extended PCIe config space of Root Port (accessing config space of PCIe devices on the other end of PCIe link is working fine). So the first suspicious thing is why kernel AER driver is using legacy shared INTx interrupt as in most cases Root Port would not report any error event via INTx. And the second thing, try to look into documentation for used PCIe controller, just in case if vendor "invented" some proprietary and non-compliant way how to report error / AER events to OS... I saw more issues with PCIe controllers as with PCIe switches so in my opinion issue would be either in controller driver or controller hw itself. And if you see event status logged in PCIe switch register I would expect that switch correctly sent PCIe Error message to Root Complex. > BTW: This was tested on v5.10 and recent v5.16-rc6. > > Thanks, > Stefan