Hi Rajat, On Wed, Feb 01, 2017 at 04:22:49PM -0800, Rajat Jain wrote: > Hello, > > We have a few devices out in the field and we get occasional kernel > crash reports where the PCI device driver often complains about not > being able to access the device (PCIe mem mapped registers read > returns all 0xFFs). Sometimes we do see a bunch of PCIe correctable > AER errors () in the dmesg for those devices: > > [ 58.527092] pcieport 0000:00:1c.0: PCIe Bus Error: > severity=Corrected, type=Data Link Layer, id=00e0(Transmitter ID) > [ 58.527104] pcieport 0000:00:1c.0: device [8086:0f48] error > status/mask=00001100/00002000 > [ 58.527114] pcieport 0000:00:1c.0: [ 8] RELAY_NUM Rollover > [ 58.527123] pcieport 0000:00:1c.0: [12] Replay Timer Timeout > > and thus we suspect that the PCIe link health may not be very good > causing this to happen. However, a lot of times, dmesg logs are not > available or has rolled over, and hence leaving us no good way of > identifying if the PCIe link has been seeing issues. I can't find any > counters or statistics in sysfs or elsewhere that can tell me what > kind of PCIE/AER errors were seen. My questions: > > 1) I'm assuming that we might not be the first ones to have to feel > the need of such counters. Essentially I'm looking for some counters > that can give indication of PCIe errors (and possibly categorized by > error type) on a per device / link basis. Does something exist > already? Does it make sense to add some counters like that (I'd be > happy to add)? > > 2) We only see the above correctable errors, and no uncorrectable > errors. Do they look that they may be noticable by the driver? I > looked up in the PCIe spec, and it seems that these only mean that the > PCIe may retry the packets and hence the PCIe transaction may take a > little longer to execute, but the transaction should not fail. Hence > the SW should not notice anything else other than the PCIe read taking > slightly longer than usual. I see that RELAY_NUM rollover may cause a > link to retrain, but in that case would the pending PCIe transaction > fail? In short can the above errors result in making a device > inaccessible to the driver? > > My understanding so far is that these correctable errors should not > have a role (other than introducing some delay), and it may be more of > a device firmware issue (that may have landed in a bad state thus > returning all FFs), and will be happy to get any more pointers. There's an open issue with AER correctable errors: https://bugzilla.kernel.org/show_bug.cgi?id=111601 No idea whether it's related to what you're seeing, but worth checking it out.