Hi Jon,
Many thanks for reaching out.
On 15/01/2025 08:55, Jon Pan-Doh wrote:
On Wed, 8 Jan 2025 13:55:30 +0000
Karolina Stolarek <karolina.stolarek@xxxxxxxxxx> wrote:
TL;DR
====
We are getting multiple reports about excessive logging of Correctable
Errors with no clear common root cause. As these errors are already
corrected by hardware, it makes sense to limit them. Introduce
a ratelimit state definition to pci_dev to control the number of
messages reported by a Root Port within a specified time interval.
The series adds other improvements in the area, as outlined in the
Proposal section.
Hi Karolina,
This is a common impediment for many folks that want to enable AER. The
excessive logging stalls execution, making machines unusable. I've been
working on a similar solution[1] to yours (i.e. ratelimiting) with a few
differences:
- ratelimit uncorrectable errors
- ratelimit IRQs
- configure ratelimits from userspace (sysfs knobs)
Hoping we can collaborate on a solution (i.e. take best parts of both patch
series).
That indeed looks like a more robust solution, I'm more than happy to
join forces and work on this together.
Feel free to incorporate the 1/4 patch into your series. I plan to do a
proper review tomorrow.
Out of curiosity, do your patches apply to cleanly to pci/err and/or
pci-next branches? From what I can see, "PCI: Consolidate TLP Log
reading and printing" series[1] had been just merged, so there could be
conflicts.
All the best,
Karolina
--------------------------------------------------------------
[1] - https://lore.kernel.org/linux-pci/20250114170840.1633-1-
ilpo.jarvinen@xxxxxxxxxxxxxxx/
Thanks,
Jon
[1] https://lore.kernel.org/linux-pci/20250115074301.3514927-1-pandoh@xxxxxxxxxx/