>From: Bjorn Helgaas <helgaas@xxxxxxxxxx> >Sent: Tuesday, January 17, 2023 4:23 PM >To: Alexey Bogoslavsky <Alexey.Bogoslavsky@xxxxxxx> >Cc: 'hch@xxxxxx' <hch@xxxxxx>; linux-pci@xxxxxxxxxxxxxxx; bhelgaas@xxxxxxxxxx; Rajat Khandelwal <rajat.khandelwal@xxxxxxxxxxxxxxx> >Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD >[+cc Rajat] >On Tue, Jan 17, 2023 at 01:20:28PM +0000, Alexey Bogoslavsky wrote: >> >From: 'hch@xxxxxx' <hch@xxxxxx> >> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote: >> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@xxxxxxx> >> >> >> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting >> >> of correctable errors. While functionally harmless, this causes error >> >> messages to appear in the system log (dmesg) which, in turn, causes >> >> problems in automated platform validation tests. >I don't think automated test problems warrant an OS change to suppress >warnings. Those are internal tests where you can automatically ignore >warnings you don't care about. >> >> Since the issue can not >> >> be fixed by FW, customers asked for correctable error reporting to be >> >> quirked out in the kernel for this particular device. >Customer complaints are a little more of an issue. But let's clarify >what's happening here. You mention "false AER reporting," and I want >to understand that better. I guess you mean that SN730 logs a >correctable error and generates an ERR_COR message when no error >actually occurred? Exactly >I hesitate to turn off correctable error reporting completely because >that information is useful for monitoring overall system health. But >there are two things I think we could do: > - Reformat the correctable error messages so they are obviously > different from uncorrectable errors and indicate that these have > been automatically corrected, and We've had long and tiresome discussions with the customers using this specific device in their design trying to convince them to just ignore the messages, but they wouldn't budge > - Possibly rate-limit them on a per-device basis, e.g., something > like https://lore.kernel.org/r/20230103165548.570377-1-rajat.khandelwal@xxxxxxxxxxxxxxx The thing is on this device correctable PCI errors can never be trusted. So reporting them in any format or at any rate makes little sense as one can never assume they're real. Also, the loss of potential real error notifications is something the customers seem to be OK with, given the pros and cons. >If we do end up having to turn off reporting completely, I would log a >note that we're doing that so we know we're giving up something and >there may be legitimate errors that we will never see. You mean, print a warning message the first time such error is reported by the device and ignore all additional ERR_COR messages? This seems like a fine idea, I would definitely be willing to implement it >> >> The patch was manually verified. It was checked that correctable errors >> >> are still detected but ignored for the target device (SN730), and are both >> >> detected and reported for devices not affected by this quirk. >> >> >> >> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@xxxxxxx >Check the history and use the same format here, i.e., > Alexey Bogoslavsky <alexey.bogoslavsky@xxxxxxx> >(Also for the From: line inserted at the top.) Sorry about that, will fix > >> --- > >> drivers/pci/pcie/aer.c | 3 +++ > >> drivers/pci/quirks.c | 6 ++++++ > >> include/linux/pci.h | 1 + > >> include/linux/pci_ids.h | 4 ++++ > >> 4 files changed, 14 insertions(+) > >> > >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > >> index d7ee79d7b192..5cc24d28b76d 100644 > >> --- a/drivers/pci/pcie/aer.c > >> +++ b/drivers/pci/pcie/aer.c > >> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > >> goto out; > >> } > >> > >> + if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors) > > >No need for the inner braces. > Will fix > > >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors); > > >Overly long line. Also wd_ seems like an odd prefix, I'd do pci_ > instead. > Will fix both, thanks > >But overall I'm not really sure it's worth adding code just to surpress > >a harmless warning. > This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else