Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 25 Jan 2025 08:39:35 +0100
Lukas Wunner <lukas@xxxxxxxxx> wrote:

> On Tue, Jan 14, 2025 at 11:42:57PM -0800, Jon Pan-Doh wrote:
> > After ratelimiting logs, spammy devices can still slow execution by
> > continued AER IRQ servicing.
> > 
> > Add higher per-device ratelimits for AER errors to mask out those IRQs.
> > Set the default rate to 3x default AER ratelimit (30 per 5s).  
> 
> Masking errors at the register level feels overzealous,
> in particular because it also disables logging via tracepoints.
> 
> Is there a concrete device that necessitates this change?
> If there is, consider adding a quirk for this particular device
> which masks specific errors, but doesn't affect other devices.
> If there isn't, consider dropping this change until a buggy device
> appears that actually needs it.

Fully agree with this comment.  At very least this should default
to not ratelimiting on the tracepoints unless a specific opt in has
occurred (probably a platform or device driver quirk).

In particular I'd worry that you are masking whatever errors are
finally trigger masking.  That might be the only one of that
particular type that was seen and I think the only report we
see is the 'I masked it message'.  So rasdaemon for example
never sees the error at all.   So another tweak would be report
one last time so we definitely see any given error type at least
once.

For CXL errors we trigger off one AER error type (internal error),
but then that is multiplexed onto finer grained errors. Even if
we fix the above we would want the masking in the CXL RAS controls,
not AER.

Jonathan


> 
> Thanks,
> 
> Lukas
> 





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux