Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs

Karolina Stolarek <karolina.stolarek@xxxxxxxxxx> · Mon, 20 Jan 2025 11:38:05 +0100

On 18/01/2025 02:58, Jon Pan-Doh wrote:
On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek
<karolina.stolarek@xxxxxxxxxx> wrote:
To confirm that I understand the flow -- when we're processing
aer_err_info, that potentially carries a couple of errors, and we hit a
ratelimit, we mask the error bits in Error Status Register and print
a warning. After doing so, we won't see these types of errors reported
again. What if some time passes (let's say, 2 mins), and we hit a
condition that would normally generate an error but it's now masked? Are
we fine with missing it? I think we should be informed about
Uncorrectable errors as much as possible, as they indicate Link
integrity issues.

Your understanding is correct. There's definitely more nuance/tradeoff
with uncorrectable errors (likelihood of uncorrectable spam vs.
missing critical errors). At the minimum, I think the uncorrectable
IRQ default should be higher (semi-arbitrarily chose defaults for
IRQs).

My comment was mostly me worrying about Uncorrectable errors. Pulling 
the plug on Correctable errors after we see too many is reasonable, I think.

I think a dynamic (un)masking in the kernel is a bit too much and
punted the decision to userspace (e.g. rasdaemon et al.) to manage
(part of OCP Fault Management groups roadmap).

I agree. Still, if we decide to go with IRQ masking for (Un)correctable 
errors, this should be communicated to the user in the documentation.

All the best,
Karolina

Other options include:
- only focus on correctable errors
     - seen uncorrectable spam e.g. new HW bringup but it is rarer
- some type of system-wide toggle (sysfs, kernel config/cmdline) for
uncorrectable spam handling (may be clunky)

Thanks,
Jon