On Tue, Oct 17, 2017 at 10:27 AM, David Laight <David.Laight@xxxxxxxxxx> wrote: > From: Bjorn Helgaas >> Sent: 16 October 2017 23:07 > ... >> I don't know how to tell what Windows is doing with respect to AER. > > Just hopeful someone might :-) > >> > I've 'bodged' the Linux kernel to think that the BIOS gave it control >> > of AER (set OSC_PCI_EXPRESS_AER_CONTROL into *mask and >> > root->osc_control_set in acpi_pci_osc_control_set()). >> > I think this is the earliest place the info is saved. >> > This is enough to the 'pcieport ... AER enabled with IRQ nn' messages >> > (It is sharing the interrupt with PME). >> > >> > I've made sure my card is beneath one of the cpu bridges (the companion >> > chip host bridges don't support AER). >> > I've also bodged the driver to ioremap() an area larger than one of the >> > BARs so I can generated PCIe read and write TLP that are outside the >> > BAR ranges. >> > Reads set CESta: NonFatalError. >> > Writes set UESTA: UnsepReq and save the TLP header. >> > No interrupts to aerdrv are generated. >> > I can clear the status bits using setpci. >> > Should I expect these errors to raise interrupts? >> >> I think that depends on the Root Error Command register. > > AFAICT that is set to 7 (all interrupts enabled), nothing gets > set in the 'pending' word that follows. > > If I unmask NonFatalErr from the card's CEMsk read errors also set > UnsupReq and save the TLP header. > > Unfortunately we don't have a PCIe analyser (too expensive), > so I can't see any TLP generated by the low level hardware. > (I can see all the read/write/completions that match the BARs.) > > I've also looked as the lspci -vvnnxxxx output from one of our > Dell server systems (I've not got one to play with). > They have the Root Error Command register set to zero. > The RootCtl register (in the main root port capabilities) has both > ErrNon-Fatal and ErrFatal set. > I think this means that the errors I'm generating would set CERR > and probably generate an NMI! > It is likely to explain why taking down the PCIe physical layer > generates an NMI even after 'echo 1 > xxx/remove'. > Maybe the kernel should be unsetting these bits when a card is > removed and restoring them after a rescan? Wish I knew these answers off the top of my head, but I don't, so all I could do is pore over the spec, which you can probably do as well as I can :)