On Sun, 2022-02-13 at 00:56 -0800, John Hubbard wrote: > On Fri, 11 Feb 2022, Rik van Riel wrote: > > > > > This is particularly embarrassing when the page was offlined due to > > having too many corrected memory errors. Now we are killing tasks > > due to them trying to access memory that probably isn't even > > corrupted. > > I'd recommend deleting that paragraph entirely. It's a separate > question, and it is not necessarily an accurate assessment of that > question either: the engineers who set the thresholds for "too many > corrected errors" may not--in fact, probably *will not*--agree with > your > feeling that the memory is still working and reliable! Fair enough. We try to offline pages before we get to a point where the error correction might no longer be able to correct the error correctly, but I am pretty sure I have seen a few odd kernel crashes following a stream of corrected errors that strongly suggested corruption had in fact happened. I'll take that paragraph out if anybody else asks for further changes for v3 of the patch. -- All Rights Reversed.
Attachment:
signature.asc
Description: This is a digitally signed message part