Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 31, 2025 at 6:44 AM Jonathan Cameron
<Jonathan.Cameron@xxxxxxxxxx> wrote:
>
> On Sat, 25 Jan 2025 08:39:35 +0100
> Lukas Wunner <lukas@xxxxxxxxx> wrote:
> > Masking errors at the register level feels overzealous,
> > in particular because it also disables logging via tracepoints.
> >
> > Is there a concrete device that necessitates this change?
> > If there is, consider adding a quirk for this particular device
> > which masks specific errors, but doesn't affect other devices.
> > If there isn't, consider dropping this change until a buggy device
> > appears that actually needs it.
>
> Fully agree with this comment.  At very least this should default
> to not ratelimiting on the tracepoints unless a specific opt in has
> occurred (probably a platform or device driver quirk).

Hi Lukas and Jonathan,

Thanks for the comments. Since IRQ ratelimiting/masking is more
drastic, it requires more nuance/thought (split the series in v2[1] as
a result).

I am not targeting specific devices per say. The intent is to allow
userspace daemons/agents to dynamically collect telemetry/handle spam.
In the context of the datacenter (i.e. several OCP members), there are
many deployments of new HW/configurations where we may see/have seen
error spam when trying to enable native AER. Kernel quirks work in the
medium term (until the underlying device is fixed), but require a
kernel rollout. There is a desire to address this faster (i.e. without
rollout/reinstall) and I think IRQ ratelimiting fits the requirements.

I like the idea of having IRQ ratelimiting off as default though as it
is a big change.

> In particular I'd worry that you are masking whatever errors are
> finally trigger masking.  That might be the only one of that
> particular type that was seen and I think the only report we
> see is the 'I masked it message'.  So rasdaemon for example
> never sees the error at all.   So another tweak would be report
> one last time so we definitely see any given error type at least
> once.

Ack.

[1] https://lore.kernel.org/linux-pci/20250214023543.992372-1-pandoh@xxxxxxxxxx/

Thanks,
Jon





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux