On 18/01/2025 02:58, Jon Pan-Doh wrote:
On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek
<karolina.stolarek@xxxxxxxxxx> wrote:
To confirm that I understand the flow -- when we're processing
aer_err_info, that potentially carries a couple of errors, and we hit a
ratelimit, we mask the error bits in Error Status Register and print
a warning. After doing so, we won't see these types of errors reported
again. What if some time passes (let's say, 2 mins), and we hit a
condition that would normally generate an error but it's now masked? Are
we fine with missing it? I think we should be informed about
Uncorrectable errors as much as possible, as they indicate Link
integrity issues.
Your understanding is correct. There's definitely more nuance/tradeoff
with uncorrectable errors (likelihood of uncorrectable spam vs.
missing critical errors). At the minimum, I think the uncorrectable
IRQ default should be higher (semi-arbitrarily chose defaults for
IRQs).
My comment was mostly me worrying about Uncorrectable errors. Pulling
the plug on Correctable errors after we see too many is reasonable, I think.
I think a dynamic (un)masking in the kernel is a bit too much and
punted the decision to userspace (e.g. rasdaemon et al.) to manage
(part of OCP Fault Management groups roadmap).
I agree. Still, if we decide to go with IRQ masking for (Un)correctable
errors, this should be communicated to the user in the documentation.
All the best,
Karolina
Other options include:
- only focus on correctable errors
- seen uncorrectable spam e.g. new HW bringup but it is rarer
- some type of system-wide toggle (sysfs, kernel config/cmdline) for
uncorrectable spam handling (may be clunky)
Thanks,
Jon