On Thu, Feb 06, 2025 at 02:56:20PM +0100, Karolina Stolarek wrote: > On 25/01/2025 08:39, Lukas Wunner wrote: > > Masking errors at the register level feels overzealous, > > in particular because it also disables logging via tracepoints. > > > > Is there a concrete device that necessitates this change? > > I faced issues with excessive Correctable Errors reporting with Samsung > PM1733 NVMe (a couple of thousand errors per hour), which were still > polluting the logs even after introducing a ratelimit I'd suggest to add a "u32 aer_cor_mask" to "struct pci_dev" in the "#ifdef CONFIG_PCIEAER" section. Then add a "DECLARE_PCI_FIXUP_HEADER()" macro in drivers/pci/quirks.c for the Samsung PM1733 which calls a new function which sets exactly the error bits you're seeing to aer_cor_mask. This should be #ifdef'ed to CONFIG_PCIEAER as well. Finally, amend aer.c to set the bits in aer_cor_mask in the PCI_ERR_COR_MASK register on probe. > > If there is, consider adding a quirk for this particular device > > which masks specific errors, but doesn't affect other devices. > > There were many other reports of Correctable Error floods, as signaled in > the cover letter, so it's hard to pinpoint the specific driver that should > mask these errors. If a specific device frequently signals the same errors, I think that's a bug of that device and if the vendor doesn't provide a firmware update, quiescing the errors through a quirk is a plausible solution. Of course if this is widespread, it becomes a maintenance nightmare and then the quirk approach is not a viable option. I cannot say whether that's the case. So far there's a report for one specific product (the Samsung drive) and hinting that the problem may be widespread. It's difficult to make a recommendation without precise data. Thanks, Lukas