On Fri, 11 Feb 2022, Rik van Riel wrote: > Sometimes the page offlining code can leave behind a hwpoisoned clean > page cache page. This can lead to programs being killed over and over > and over again as they fault in the hwpoisoned page, get killed, and > then get re-spawned by whatever wanted to run them. Hi Rik, This looks good. Some minor suggestions below: > > This is particularly embarrassing when the page was offlined due to > having too many corrected memory errors. Now we are killing tasks > due to them trying to access memory that probably isn't even corrupted. I'd recommend deleting that paragraph entirely. It's a separate question, and it is not necessarily an accurate assessment of that question either: the engineers who set the thresholds for "too many corrected errors" may not--in fact, probably *will not*--agree with your feeling that the memory is still working and reliable! > > This problem can be avoided by invalidating the page from the page > fault handler, which already has a branch for dealing with these > kinds of pages. With this patch we simply pretend the page fault > was successful if the page was invalidated, return to userspace, > incur another page fault, read in the file from disk (to a new > memory page), and then everything works again. Very nice write-up here, it's rare that I get to read such a short, yet clear explanation. Thank you for that. :) > > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> > > diff --git a/mm/memory.c b/mm/memory.c > index c125c4969913..2300358e268c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > return ret; > > if (unlikely(PageHWPoison(vmf->page))) { > - if (ret & VM_FAULT_LOCKED) > + int poisonret = VM_FAULT_HWPOISON; > + if (ret & VM_FAULT_LOCKED) { How about changing those two lines to these three (note the newline after the declaration): vm_fault_t poison_ret = VM_FAULT_HWPOISON; if (ret & VM_FAULT_LOCKED) { ...which should fix the krobot complaint, and the underscore is just a very tiny readability improvement while we're there. > + /* Retry if a clean page was removed from the cache. */ This is more cryptic than necessary, and in fact you've already provided a write-up, so how about something like this, instead of the above: /* * Avoid refaulting on the same poisoned page * forever: attempt to invalidate the page. If that * succeeds, then pretend the page fault was successful, * return to userspace, incur another page fault, read * in the file from disk (to a new page),and then * everything works again. */ > + if (invalidate_inode_page(vmf->page)) > + poisonret = 0; > unlock_page(vmf->page); > + } > put_page(vmf->page); > vmf->page = NULL; > - return VM_FAULT_HWPOISON; > + return poisonret; > } > > if (unlikely(!(ret & VM_FAULT_LOCKED))) > > > -- > All rights reversed. Anyway, those are all documentation nits, except for the vm_fault_t, which is a declaration nit :) , so I'm comfortable saying: Reviewed-by: John Hubbard <jhubbard@xxxxxxxxxx> thanks, -- John Hubbard NVIDIA