Hi Jon,
On 15/01/2025 08:42, Jon Pan-Doh wrote:
Proposal
========
When using native AER, spammy devices can flood kernel logs with AER errors
and slow/stall execution. Add per-device per-error-severity ratelimits
for more robust error logging. Allow userspace to configure ratelimits
via sysfs knobs.
Do you have any update on the series?
I'm aware that a lot is happening in the AER code right now, so I was
thinking if it would be helpful to split up the series to get the logs
ratelimiting in sooner. There are some concerns about disabling error
generation that should be discussed, but I don't want them to block the
logs ratelimit changes. I think it would be good to fix this first to
save people (myself included) from overflown syslogs.
All the best,
Karolina
Motivation
==========
Several OCP members have issues with inconsistent PCIe error handling,
exacerbated at datacenter scale (myriad of devices).
OCP HW/Fault Management subproject set out to solve this by
standardizing industry:
- PCIe error handling best practices
- Fault Management/RAS (incl. PCIe errors)
Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
rasdaemon) to collect/pass on to repairability services is part of the
roadmap.
Background
==========
AER error spam has been observed many times, both publicly (e.g. [1], [2],
[3]) and privately. While it usually occurs with correctable errors, it can
happen with uncorrectable errors (e.g. during new HW bringup).
There have been previous attempts to add ratelimits to AER logs ([4],
[5]). The most recent attempt[5] has many similarities with the proposed
approach.
Patch organization
==================
1-3 AER logging cleanup
4-7 Ratelimits and sysfs knobs
8 Sysfs cleanup (RFC that breaks existing ABI/can be dropped)
Outstanding work
================
Cleanup:
- Consolidate aer_print_error() and pci_print_error() path
- Elevate log level logic out of print functions[6]
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@xxxxxxxxxxxx/
[5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@xxxxxxxxxx/
[6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@xxxxxxxxxx/
Jon Pan-Doh (8):
PCI/AER: Remove aer_print_port_info
PCI/AER: Move AER stat collection out of __aer_print_error
PCI/AER: Rename struct aer_stats to aer_info
PCI/AER: Introduce ratelimit for error logs
PCI/AER: Introduce ratelimit for AER IRQs
PCI/AER: Add AER sysfs attributes for ratelimits
PCI/AER: Update AER sysfs ABI filename
PCI/AER: Move AER sysfs attributes into separate directory
...es-aer_stats => sysfs-bus-pci-devices-aer} | 50 +++-
Documentation/PCI/pcieaer-howto.rst | 10 +-
drivers/pci/pci-sysfs.c | 2 +-
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/aer.c | 227 +++++++++++++-----
include/linux/pci.h | 2 +-
6 files changed, 216 insertions(+), 77 deletions(-)
rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (69%)