Re: PCIe error reporting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 09, 2017 at 03:12:45PM +0000, David Laight wrote:
> I'm trying to determine how a PCIe card we are building handles (and
> hopefully recovers from) PCIe link errors.
> However I'm not at all sure what I should expect the x86 Linux host to do.
> 
> The card has an Altera FPGA and I can monitor things like changes to
> it's LTSSM state engine, but not quite the full operation of the PCIe logic.
> 
> I've enabled AER and lspci seems to decode most of the bits but
> it looks as though something needs to detect error bits being set
> log the error and then clear them.
> 
> I did a rather brutal test - shorted the TX lines after the caps.
> The card's PCIe logic issued a reset to the internal logic before
> bringing the PCIe link back up.
> I could then read config space - but the BARs were all zero
> (I think lspci reported the old values, but the -x data showed zeros).
> Nothing seemed to indicate the Linux thought anything was wrong.
> Not surprisingly reads returned ~0u.
> 
> I should really try a much shorter error.
> 
> On that system (XEON E5-2600) dmesg contains (retyped):
>   acpi: PNP0A08:00: _OSC: platform does not support [AER]

Per the PCI Firmware spec r3.0, sec 4.5.1, on ACPI platforms, the OS
is supposed to use _OSC to ask the firmware for permission before
using AER.  In this case, the firmware declined to grant us
permission, so we're not supposed to do anything at all with AER.

I think the idea is that the firmware itself is supposed to be
handling AER in this case, and it doesn't want the OS to get in the
way.

> Another system is even more 'useless', it reports "AER handled
> by the firmware".

This means the ACPI HEST table has the "firmware first" bit set, which
means the BIOS is supposed to field the AER interrupt, read the AER
logging CSRs, package them up, and deliver them to the OS.  The OS is
supposed to feed the error info into the OS AER recovery path.

Linux AER recovery *should* work on this system.  But I don't think
AER is well-tested in general, and any time we have firmware-OS
interfaces, there's always potential for misunderstandings.

> If we take the PCIe link down (even after echo 1 >sys/devices/.../remove)
> something generates an NMI!
> 
> Is this all 'expected' behaviour?
> Anything else I should/could be looking at?
> Is there anything that will poll the AER bits for me?
> 
> 	David
> 
> 



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux