[+cc Rajat] On Tue, Jan 17, 2023 at 01:20:28PM +0000, Alexey Bogoslavsky wrote: > >From: 'hch@xxxxxx' <hch@xxxxxx> > >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote: > >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@xxxxxxx> > >> > >> A bug was found in SN730 WD SSD that causes occasional false AER reporting > >> of correctable errors. While functionally harmless, this causes error > >> messages to appear in the system log (dmesg) which, in turn, causes > >> problems in automated platform validation tests. I don't think automated test problems warrant an OS change to suppress warnings. Those are internal tests where you can automatically ignore warnings you don't care about. > >> Since the issue can not > >> be fixed by FW, customers asked for correctable error reporting to be > >> quirked out in the kernel for this particular device. Customer complaints are a little more of an issue. But let's clarify what's happening here. You mention "false AER reporting," and I want to understand that better. I guess you mean that SN730 logs a correctable error and generates an ERR_COR message when no error actually occurred? I hesitate to turn off correctable error reporting completely because that information is useful for monitoring overall system health. But there are two things I think we could do: - Reformat the correctable error messages so they are obviously different from uncorrectable errors and indicate that these have been automatically corrected, and - Possibly rate-limit them on a per-device basis, e.g., something like https://lore.kernel.org/r/20230103165548.570377-1-rajat.khandelwal@xxxxxxxxxxxxxxx If we do end up having to turn off reporting completely, I would log a note that we're doing that so we know we're giving up something and there may be legitimate errors that we will never see. > >> The patch was manually verified. It was checked that correctable errors > >> are still detected but ignored for the target device (SN730), and are both > >> detected and reported for devices not affected by this quirk. > >> > >> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@xxxxxxx Check the history and use the same format here, i.e., Alexey Bogoslavsky <alexey.bogoslavsky@xxxxxxx> (Also for the From: line inserted at the top.) > >> --- > >> drivers/pci/pcie/aer.c | 3 +++ > >> drivers/pci/quirks.c | 6 ++++++ > >> include/linux/pci.h | 1 + > >> include/linux/pci_ids.h | 4 ++++ > >> 4 files changed, 14 insertions(+) > >> > >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > >> index d7ee79d7b192..5cc24d28b76d 100644 > >> --- a/drivers/pci/pcie/aer.c > >> +++ b/drivers/pci/pcie/aer.c > >> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > >> goto out; > >> } > >> > >> + if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors) > > >No need for the inner braces. > Will fix > > >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors); > > >Overly long line. Also wd_ seems like an odd prefix, I'd do pci_ > instead. > Will fix both, thanks > >But overall I'm not really sure it's worth adding code just to surpress > >a harmless warning. > This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else