Re: PCIe error reporting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 17, 2017 at 10:27 AM, David Laight <David.Laight@xxxxxxxxxx> wrote:
> From: Bjorn Helgaas
>> Sent: 16 October 2017 23:07
> ...
>> I don't know how to tell what Windows is doing with respect to AER.
>
> Just hopeful someone might :-)
>
>> > I've 'bodged' the Linux kernel to think that the BIOS gave it control
>> > of AER (set OSC_PCI_EXPRESS_AER_CONTROL into *mask and
>> > root->osc_control_set in acpi_pci_osc_control_set()).
>> > I think this is the earliest place the info is saved.
>> > This is enough to the 'pcieport ... AER enabled with IRQ nn' messages
>> > (It is sharing the interrupt with PME).
>> >
>> > I've made sure my card is beneath one of the cpu bridges (the companion
>> > chip host bridges don't support AER).
>> > I've also bodged the driver to ioremap() an area larger than one of the
>> > BARs so I can generated PCIe read and write TLP that are outside the
>> > BAR ranges.
>> > Reads set CESta: NonFatalError.
>> > Writes set UESTA: UnsepReq and save the TLP header.
>> > No interrupts to aerdrv are generated.
>> > I can clear the status bits using setpci.
>> > Should I expect these errors to raise interrupts?
>>
>> I think that depends on the Root Error Command register.
>
> AFAICT that is set to 7 (all interrupts enabled), nothing gets
> set in the 'pending' word that follows.
>
> If I unmask NonFatalErr from the card's CEMsk read errors also set
> UnsupReq and save the TLP header.
>
> Unfortunately we don't have a PCIe analyser (too expensive),
> so I can't see any TLP generated by the low level hardware.
> (I can see all the read/write/completions that match the BARs.)
>
> I've also looked as the lspci -vvnnxxxx output from one of our
> Dell server systems (I've not got one to play with).
> They have the Root Error Command register set to zero.
> The RootCtl register (in the main root port capabilities) has both
> ErrNon-Fatal and ErrFatal set.
> I think this means that the errors I'm generating would set CERR
> and probably generate an NMI!
> It is likely to explain why taking down the PCIe physical layer
> generates an NMI even after 'echo 1 > xxx/remove'.
> Maybe the kernel should be unsetting these bits when a card is
> removed and restoring them after a rescan?

Wish I knew these answers off the top of my head, but I don't, so all
I could do is pore over the spec, which you can probably do as well as
I can :)



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux