On Sat, 25 Jan 2025 08:39:35 +0100 Lukas Wunner <lukas@xxxxxxxxx> wrote: > On Tue, Jan 14, 2025 at 11:42:57PM -0800, Jon Pan-Doh wrote: > > After ratelimiting logs, spammy devices can still slow execution by > > continued AER IRQ servicing. > > > > Add higher per-device ratelimits for AER errors to mask out those IRQs. > > Set the default rate to 3x default AER ratelimit (30 per 5s). > > Masking errors at the register level feels overzealous, > in particular because it also disables logging via tracepoints. > > Is there a concrete device that necessitates this change? > If there is, consider adding a quirk for this particular device > which masks specific errors, but doesn't affect other devices. > If there isn't, consider dropping this change until a buggy device > appears that actually needs it. Fully agree with this comment. At very least this should default to not ratelimiting on the tracepoints unless a specific opt in has occurred (probably a platform or device driver quirk). In particular I'd worry that you are masking whatever errors are finally trigger masking. That might be the only one of that particular type that was seen and I think the only report we see is the 'I masked it message'. So rasdaemon for example never sees the error at all. So another tweak would be report one last time so we definitely see any given error type at least once. For CXL errors we trigger off one AER error type (internal error), but then that is multiplexed onto finer grained errors. Even if we fix the above we would want the masking in the CXL RAS controls, not AER. Jonathan > > Thanks, > > Lukas >