Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs

Jon Pan-Doh <pandoh@xxxxxxxxxx> · Fri, 17 Jan 2025 17:58:50 -0800

On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek
<karolina.stolarek@xxxxxxxxxx> wrote:
> To confirm that I understand the flow -- when we're processing
> aer_err_info, that potentially carries a couple of errors, and we hit a
> ratelimit, we mask the error bits in Error Status Register and print
> a warning. After doing so, we won't see these types of errors reported
> again. What if some time passes (let's say, 2 mins), and we hit a
> condition that would normally generate an error but it's now masked? Are
> we fine with missing it? I think we should be informed about
> Uncorrectable errors as much as possible, as they indicate Link
> integrity issues.

Your understanding is correct. There's definitely more nuance/tradeoff
with uncorrectable errors (likelihood of uncorrectable spam vs.
missing critical errors). At the minimum, I think the uncorrectable
IRQ default should be higher (semi-arbitrarily chose defaults for
IRQs).

I think a dynamic (un)masking in the kernel is a bit too much and
punted the decision to userspace (e.g. rasdaemon et al.) to manage
(part of OCP Fault Management groups roadmap).

Other options include:
- only focus on correctable errors
    - seen uncorrectable spam e.g. new HW bringup but it is rarer
- some type of system-wide toggle (sysfs, kernel config/cmdline) for
uncorrectable spam handling (may be clunky)

Thanks,
Jon