Hi Jon, Can you share the base commit used here? I would like to try the patchset. Regards, Terry On 1/15/2025 1:42 AM, Jon Pan-Doh wrote: > Proposal > ======== > > When using native AER, spammy devices can flood kernel logs with AER errors > and slow/stall execution. Add per-device per-error-severity ratelimits > for more robust error logging. Allow userspace to configure ratelimits > via sysfs knobs. > > Motivation > ========== > > Several OCP members have issues with inconsistent PCIe error handling, > exacerbated at datacenter scale (myriad of devices). > OCP HW/Fault Management subproject set out to solve this by > standardizing industry: > > - PCIe error handling best practices > - Fault Management/RAS (incl. PCIe errors) > > Exposing PCIe errors/debug info in-band for a userspace daemon (e.g. > rasdaemon) to collect/pass on to repairability services is part of the > roadmap. > > Background > ========== > > AER error spam has been observed many times, both publicly (e.g. [1], [2], > [3]) and privately. While it usually occurs with correctable errors, it can > happen with uncorrectable errors (e.g. during new HW bringup). > > There have been previous attempts to add ratelimits to AER logs ([4], > [5]). The most recent attempt[5] has many similarities with the proposed > approach. > > Patch organization > ================== > 1-3 AER logging cleanup > 4-7 Ratelimits and sysfs knobs > 8 Sysfs cleanup (RFC that breaks existing ABI/can be dropped) > > Outstanding work > ================ > Cleanup: > - Consolidate aer_print_error() and pci_print_error() path > - Elevate log level logic out of print functions[6] > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027 > [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517 > [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183 > [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@xxxxxxxxxxxx/ > [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@xxxxxxxxxx/ > [6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@xxxxxxxxxx/ > > Jon Pan-Doh (8): > PCI/AER: Remove aer_print_port_info > PCI/AER: Move AER stat collection out of __aer_print_error > PCI/AER: Rename struct aer_stats to aer_info > PCI/AER: Introduce ratelimit for error logs > PCI/AER: Introduce ratelimit for AER IRQs > PCI/AER: Add AER sysfs attributes for ratelimits > PCI/AER: Update AER sysfs ABI filename > PCI/AER: Move AER sysfs attributes into separate directory > > ...es-aer_stats => sysfs-bus-pci-devices-aer} | 50 +++- > Documentation/PCI/pcieaer-howto.rst | 10 +- > drivers/pci/pci-sysfs.c | 2 +- > drivers/pci/pci.h | 2 +- > drivers/pci/pcie/aer.c | 227 +++++++++++++----- > include/linux/pci.h | 2 +- > 6 files changed, 216 insertions(+), 77 deletions(-) > rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (69%) >