Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[+cc Rajat]

On Tue, Jan 17, 2023 at 01:20:28PM +0000, Alexey Bogoslavsky wrote:
> >From: 'hch@xxxxxx' <hch@xxxxxx> 
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@xxxxxxx>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests.

I don't think automated test problems warrant an OS change to suppress
warnings.  Those are internal tests where you can automatically ignore
warnings you don't care about.

> >> Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.

Customer complaints are a little more of an issue.  But let's clarify
what's happening here.  You mention "false AER reporting," and I want
to understand that better.  I guess you mean that SN730 logs a
correctable error and generates an ERR_COR message when no error
actually occurred?

I hesitate to turn off correctable error reporting completely because
that information is useful for monitoring overall system health.  But
there are two things I think we could do:

  - Reformat the correctable error messages so they are obviously
    different from uncorrectable errors and indicate that these have
    been automatically corrected, and

  - Possibly rate-limit them on a per-device basis, e.g., something
    like https://lore.kernel.org/r/20230103165548.570377-1-rajat.khandelwal@xxxxxxxxxxxxxxx

If we do end up having to turn off reporting completely, I would log a
note that we're doing that so we know we're giving up something and
there may be legitimate errors that we will never see.

> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
> >>
> >> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@xxxxxxx

Check the history and use the same format here, i.e.,

  Alexey Bogoslavsky <alexey.bogoslavsky@xxxxxxx>

(Also for the From: line inserted at the top.)

> >> ---
> >>  drivers/pci/pcie/aer.c  | 3 +++
> >>  drivers/pci/quirks.c    | 6 ++++++
> >>  include/linux/pci.h     | 1 +
> >>  include/linux/pci_ids.h | 4 ++++
> >>  4 files changed, 14 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index d7ee79d7b192..5cc24d28b76d 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> >>               goto out;
> >>       }
> >>
> >> +     if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)
> 
> >No need for the inner braces.
> Will fix
> 
> >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);
> 
> >Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
> instead.
> Will fix both, thanks
> >But overall I'm not really sure it's worth adding code just to surpress
> >a harmless warning.
> This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux