RE: PCIe error reporting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Bjorn Helgaas
> Sent: 16 October 2017 23:07
...
> I don't know how to tell what Windows is doing with respect to AER.

Just hopeful someone might :-)

> > I've 'bodged' the Linux kernel to think that the BIOS gave it control
> > of AER (set OSC_PCI_EXPRESS_AER_CONTROL into *mask and
> > root->osc_control_set in acpi_pci_osc_control_set()).
> > I think this is the earliest place the info is saved.
> > This is enough to the 'pcieport ... AER enabled with IRQ nn' messages
> > (It is sharing the interrupt with PME).
> >
> > I've made sure my card is beneath one of the cpu bridges (the companion
> > chip host bridges don't support AER).
> > I've also bodged the driver to ioremap() an area larger than one of the
> > BARs so I can generated PCIe read and write TLP that are outside the
> > BAR ranges.
> > Reads set CESta: NonFatalError.
> > Writes set UESTA: UnsepReq and save the TLP header.
> > No interrupts to aerdrv are generated.
> > I can clear the status bits using setpci.
> > Should I expect these errors to raise interrupts?
> 
> I think that depends on the Root Error Command register.

AFAICT that is set to 7 (all interrupts enabled), nothing gets
set in the 'pending' word that follows.

If I unmask NonFatalErr from the card's CEMsk read errors also set
UnsupReq and save the TLP header.

Unfortunately we don't have a PCIe analyser (too expensive),
so I can't see any TLP generated by the low level hardware.
(I can see all the read/write/completions that match the BARs.)

I've also looked as the lspci -vvnnxxxx output from one of our
Dell server systems (I've not got one to play with).
They have the Root Error Command register set to zero.
The RootCtl register (in the main root port capabilities) has both
ErrNon-Fatal and ErrFatal set.
I think this means that the errors I'm generating would set CERR
and probably generate an NMI!
It is likely to explain why taking down the PCIe physical layer
generates an NMI even after 'echo 1 > xxx/remove'.
Maybe the kernel should be unsetting these bits when a card is
removed and restoring them after a rescan?

	David




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux