Hello Bjorn, Sorry for not addressing your questions earlier. As you may have heard, WD experienced a hacking attack which left us with no access to the company e-mail for weeks. As for the patch, no FW change was an option as the product causing the issue was basically at the end of life. So, I prepared a workaround that took into account all the comments from the community. Yet, at this point it seems like the company has lost interest in promoting this patch altogether. So we could just drop it. Please let me know if there's anything I need to do to request that officially. Thank you, Alexey -----Original Message----- From: Bjorn Helgaas <helgaas@xxxxxxxxxx> Sent: Wednesday, April 12, 2023 1:15 AM To: Alexey Bogoslavsky <Alexey.Bogoslavsky@xxxxxxx> Cc: Keith Busch <kbusch@xxxxxxxxxx>; linux-pci@xxxxxxxxxxxxxxx; Bjorn Helgas <bhelgaas@xxxxxxxxxx>; Christoph Hellwig <hch@xxxxxx>; Grant Grundler <grundler@xxxxxxxxxxxx>; Rajat Khandelwal <rajat.khandelwal@xxxxxxxxxxxxxxx> Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe. [+cc Grant, Rajat] On Tue, Jan 17, 2023 at 06:15:28PM +0000, Alexey Bogoslavsky wrote: > >From: Keith Busch <kbusch@xxxxxxxxxx> > >Sent: Tuesday, January 17, 2023 5:55 PM > >To: Alexey Bogoslavsky <Alexey.Bogoslavsky@xxxxxxx> > >Cc: linux-pci@xxxxxxxxxxxxxxx; bhelgaas@xxxxxxxxxx; 'hch@xxxxxx' <hch@xxxxxx> > >Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD > > >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote: > >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@xxxxxxx> > >> > >> A bug was found in SN730 WD SSD that causes occasional false AER reporting > >> of correctable errors. While functionally harmless, this causes error > >> messages to appear in the system log (dmesg) which, in turn, causes > >> problems in automated platform validation tests. Since the issue can not > >> be fixed by FW, customers asked for correctable error reporting to be > >> quirked out in the kernel for this particular device. > > > >> The patch was manually verified. It was checked that correctable errors > >> are still detected but ignored for the target device (SN730), and are both > >> detected and reported for devices not affected by this quirk. > > >If you're just going to have the kernel ignore these, are you not able > >to suppress the ERR_COR message at the source? Have the following > >options been tried? > > > a. Disabling Correctable Error Reporting Enable in Device Control > > Register; i.e. mask out PCI_EXP_DEVCTL_CERE. > > b. Setting AER Correctable Error Mask Register to all 1's > > >I think it's usually possible for firmware to hardwire these. If the > > I believe these options were discussed but deemed non-viable. I'll > double check anyway > > >If firmware can't do that, quirking the kernel to always disable > >reporting sounds like a better option. If either of the above fail > >to suppress the error messages, then I guess having the kernel > >ignore it is the only option. > > This could probably work. I'll discuss this with our FW team to make > sure the issue can be resolved this way. Thank you Any resolution on this FW possibility? We have patches in progress to rate-limit correctable error messages and make them KERN_INFO instead of KERN_WARN [1], but I don't think that's going to be a good enough solution for you because nobody wants to see even an informational message every 5 seconds if the message is useless. If firmware on the device can turn off these errors, that would be the best solution. If not, I think your quirk is a reasonable approach and just needs a litle polishing per the previous comments. Bjorn [1] https://lore.kernel.org/r/20230317175109.3859943-1-grundler@xxxxxxxxxxxx