Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[+cc Grant, Rajat]

On Tue, Jan 17, 2023 at 06:15:28PM +0000, Alexey Bogoslavsky wrote:
> >From: Keith Busch <kbusch@xxxxxxxxxx> 
> >Sent: Tuesday, January 17, 2023 5:55 PM
> >To: Alexey Bogoslavsky <Alexey.Bogoslavsky@xxxxxxx>
> >Cc: linux-pci@xxxxxxxxxxxxxxx; bhelgaas@xxxxxxxxxx; 'hch@xxxxxx' <hch@xxxxxx>
> >Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
> 
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@xxxxxxx>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests. Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.
> >
> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
> 
> >If you're just going to have the kernel ignore these, are you not able
> >to suppress the ERR_COR message at the source? Have the following
> >options been tried?
> 
> > a. Disabling Correctable Error Reporting Enable in Device Control
> >    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
> > b. Setting AER Correctable Error Mask Register to all 1's
> 
> >I think it's usually possible for firmware to hardwire these. If the
>
> I believe these options were discussed but deemed non-viable. I'll
> double check anyway
>
> >If firmware can't do that, quirking the kernel to always disable
> >reporting sounds like a better option. If either of the above fail
> >to suppress the error messages, then I guess having the kernel
> >ignore it is the only option.
>
> This could probably work. I'll discuss this with our FW team to make
> sure the issue can be resolved this way. Thank you

Any resolution on this FW possibility?

We have patches in progress to rate-limit correctable error messages
and make them KERN_INFO instead of KERN_WARN [1], but I don't think
that's going to be a good enough solution for you because nobody wants
to see even an informational message every 5 seconds if the message is
useless.

If firmware on the device can turn off these errors, that would be the
best solution.  If not, I think your quirk is a reasonable approach
and just needs a litle polishing per the previous comments.

Bjorn

[1] https://lore.kernel.org/r/20230317175109.3859943-1-grundler@xxxxxxxxxxxx



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux