[non-standard email quoting below, ">" for both my text and Hinko's, see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style] On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote: > ________________________________________ > From: Bjorn Helgaas <helgaas@xxxxxxxxxx> > Sent: Friday, December 4, 2020 11:38 PM > To: Hinko Kocevar > Cc: linux-pci@xxxxxxxxxxxxxxx > Subject: Re: Recovering from AER: Uncorrected (Fatal) error > > On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote: > > Hi, > > > > I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller: > > > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008 > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID) > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout (First) > > > > This is the complete PCI device tree that I'm working with: > > > > $ sudo /usr/local/bin/pcicrawler -t > > 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 > > ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 > > │ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4 > > │ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748) > > │ │ ├─04:00.0 downstream_port, slot 10, power: Off > > │ │ ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4 > > │ │ │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024 > > │ │ ├─04:02.0 downstream_port, slot 9, power: Off > > │ │ ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4 > > │ │ │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024 > > │ │ ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4 > > │ │ │ └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024 > > │ │ ├─04:09.0 downstream_port, slot 11, power: Off > > │ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4 > > │ │ │ └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024 > > │ │ ├─04:0b.0 downstream_port, slot 12, power: Off > > │ │ ├─04:10.0 downstream_port, slot 8, power: Off > > │ │ ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1 > > │ │ │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011 > > │ │ └─04:12.0 downstream_port, slot 7, power: Off > > │ ├─02:02.0 downstream_port, slot 2 > > │ ├─02:08.0 downstream_port, slot 8 > > │ ├─02:09.0 downstream_port, slot 9, power: Off > > │ └─02:0a.0 downstream_port, slot 10 > > ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > > > 00:01.0 is on a CPU board, The 03:00.0 and everything below that is > > not on a CPU board (working with a micro TCA system here). I'm > > working with FPGA based devices seen as endpoints here. After the > > error all the endpoints in the above list are not able to talk to > > CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI > > config space looks sane and accessible for those devices. > > This could be caused by a reset. We probably do a secondary bus reset > on the Root Port, which resets all the devices below it. After the > reset, config space of those downstream devices would be accessible, > but the PCI_COMMAND register may be cleared which means the device > wouldn't respond to MMIO reads. > > After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset. > > None of this explains the original problem of the Completion Timeout, > of course. The error source of 0x8 (00:01.0) is the root port, which > makes sense if it issued a request transaction and never got the > completion. The root port *should* log the header of the request and > we should print it, but it looks like we didn't. "lspci -vv" of the > device would show whether it's capable of this logging. > > I'm mostly concerned about the fact that the secondary bus is > rendered useless after the reset/recovery. AIUI, this should not > happen. It should not. The secondary bus should still work after the reset. > After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors: > > Capabilities: [1c0 v1] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ > AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > HeaderLog: 00000000 00000001 00000002 00000003 > RootCmd: CERptEn+ NFERptEn+ FERptEn+ > RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- > FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 > ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008 > > The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'. > > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0 > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0 > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] CmpltTO > Dec 9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2 > > > Question: would using different (real?) HEADER_LOG values affect the recovery behaviour? No, it shouldn't. I'm pretty sure we only print those values and they don't influence behavior at all. > > How can I debug this further? I'm getting the "I/O to channel is > > blocked" (pci_channel_io_frozen) state reported to the the endpoint > > devices I provide driver for. Is there any way of telling if the > > PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; > > links up and running and similar? Is this information provided by > > the Linux kernel somehow? > > > > For reference, I've experimented with AER inject and my tests showed > > that if the same type of error is injected in any other PCI device > > in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is > > recovered successfully, and I can continue working with the endpoint > > devices. In those cases the "I/O channel is in normal state" > > (pci_channel_io_normal) state was reported; only error injection > > into 00:01.0 reports pci_channel_io_frozen state. Recovery code in > > the endpoint drivers I maintain is just calling the > > pci_cleanup_aer_uncorrect_error_status() in error handler resume() > > callback. > > > > FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel > > which I believe is quite old. At the moment I can not use newer > > kernel, but would be prepared to take that step if told that it > > would help. > > It's really not practical for us to help debug a v3.10-based kernel; > that's over seven years old and AER handling has changed significantly > since then. Also, CentOS is a distro kernel and includes many changes > on top of the v3.10 upstream kernel. Those changes might be > theoretically open-source, but as a practical matter, the distro is a > better place to start for support of their kernel. > > I followed your advice and started using kernel 5.9.12; hope this is recent enough. > > With that kernel I see a bit of different behavior than with the > 3.10 earlier. Configuration access to the devices on the secondary > bus is not OK; all 0xFFFFFFFF. This is what I see after the error > was injected into the root port 00:01.0 and secondary bus reset was > made as part of recovery: > > 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 128 > 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: pcieport > > 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: pcieport > > 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: pcieport > > 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: mrf-pci > > The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED. > How come that the 01:00.0 upstream port PCI_COMMAND register is > cleared? I guess that explains why the rest of the devices in the > chain down to the endpoint at 0e:00.0 are rendered useless. > > Then, as part of having fun, I made the pci_bus_error_reset() *NOT* > do the actual reset of the secondary by just returning 0 before the > call to the pci_bus_reset(). In this case no bus reset is done, > obviously, and MMIO access to my endpoint(s) is working just fine > after the error was injected into the root port; as if nothing > happened. > > Is it expected that in cases of seeing this type of error the > secondary bus would be in this kind of state? It looks to me that > the problem is not with my endpoints but with the PCIe switch on the > CPU board (01:00.0 <--> 02:0?.0). Hmm. Yeah, I don't see anything in the path that would restore the 01:00.0 config space, which seems like a problem. Probably its secondary/subordinate bus registers and window registers get reset to 0 ("lspci -vv" would show that) so config/mem/io space below it is accessible. I didn't have much confidence in our error handling to begin with, and it's eroding ;) Bjorn