Can AER correctable errors make the device inaccessible?

Rajat Jain <rajatja@xxxxxxxxxx> · Wed, 1 Feb 2017 16:22:49 -0800

Hello,

We have a few devices out in the field and we get occasional kernel
crash reports where the PCI device driver often complains about not
being able to access the device (PCIe mem mapped registers read
returns all 0xFFs). Sometimes we do see a bunch of PCIe correctable
AER errors () in the dmesg for those devices:

[   58.527092] pcieport 0000:00:1c.0: PCIe Bus Error:
severity=Corrected, type=Data Link Layer, id=00e0(Transmitter ID)
[   58.527104] pcieport 0000:00:1c.0:   device [8086:0f48] error
status/mask=00001100/00002000
[   58.527114] pcieport 0000:00:1c.0:    [ 8] RELAY_NUM Rollover
[   58.527123] pcieport 0000:00:1c.0:    [12] Replay Timer Timeout

 and thus we suspect that the PCIe link health may not be very good
causing this to happen. However, a lot of times, dmesg logs are not
available or has rolled over, and hence leaving us no good way of
identifying if the PCIe link has been seeing issues. I can't find any
counters or statistics in sysfs or elsewhere that can tell me what
kind of PCIE/AER errors were seen. My questions:

1) I'm assuming that we might not be the first ones to have to feel
the need of such counters. Essentially I'm looking for some counters
that can give indication of PCIe errors (and possibly categorized by
error type) on a per device / link basis. Does something exist
already? Does it make sense to add some counters like that (I'd be
happy to add)?

2) We only see the above correctable errors, and no uncorrectable
errors. Do they look that they may be noticable by the driver? I
looked up in the PCIe spec, and it seems that these only mean that the
PCIe may retry the packets and hence the PCIe transaction may take a
little longer to execute, but the transaction should not fail. Hence
the SW should not notice anything else other than the PCIe read taking
slightly longer than usual. I see that RELAY_NUM rollover may cause a
link to retrain, but in that case would the pending PCIe transaction
fail? In short can the above errors result in making a device
inaccessible to the driver?

My understanding so far is that these correctable errors should not
have a role (other than introducing some delay), and it may be more of
a device firmware issue (that may have landed in a bad state thus
returning all FFs), and will be happy to get any more pointers.

Thanks,

Rajat