On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek <karolina.stolarek@xxxxxxxxxx> wrote: > To confirm that I understand the flow -- when we're processing > aer_err_info, that potentially carries a couple of errors, and we hit a > ratelimit, we mask the error bits in Error Status Register and print > a warning. After doing so, we won't see these types of errors reported > again. What if some time passes (let's say, 2 mins), and we hit a > condition that would normally generate an error but it's now masked? Are > we fine with missing it? I think we should be informed about > Uncorrectable errors as much as possible, as they indicate Link > integrity issues. Your understanding is correct. There's definitely more nuance/tradeoff with uncorrectable errors (likelihood of uncorrectable spam vs. missing critical errors). At the minimum, I think the uncorrectable IRQ default should be higher (semi-arbitrarily chose defaults for IRQs). I think a dynamic (un)masking in the kernel is a bit too much and punted the decision to userspace (e.g. rasdaemon et al.) to manage (part of OCP Fault Management groups roadmap). Other options include: - only focus on correctable errors - seen uncorrectable spam e.g. new HW bringup but it is rarer - some type of system-wide toggle (sysfs, kernel config/cmdline) for uncorrectable spam handling (may be clunky) Thanks, Jon