On 2022/2/12 6:05, Rik van Riel wrote: > Sometimes the page offlining code can leave behind a hwpoisoned clean > page cache page. This can lead to programs being killed over and over Yep, __soft_offline_page tries to invalidate_inode_page in a lightway. > and over again as they fault in the hwpoisoned page, get killed, and > then get re-spawned by whatever wanted to run them. > > This is particularly embarrassing when the page was offlined due to > having too many corrected memory errors. Now we are killing tasks > due to them trying to access memory that probably isn't even corrupted. > > This problem can be avoided by invalidating the page from the page > fault handler, which already has a branch for dealing with these > kinds of pages. With this patch we simply pretend the page fault > was successful if the page was invalidated, return to userspace, > incur another page fault, read in the file from disk (to a new > memory page), and then everything works again. > > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> Good catch! This looks good to me. Thanks. Reviewed-by: Miaohe Lin <linmiaohe@xxxxxxxxxx> > > diff --git a/mm/memory.c b/mm/memory.c > index c125c4969913..2300358e268c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > return ret; > > if (unlikely(PageHWPoison(vmf->page))) { > - if (ret & VM_FAULT_LOCKED) > + int poisonret = VM_FAULT_HWPOISON; > + if (ret & VM_FAULT_LOCKED) { > + /* Retry if a clean page was removed from the cache. */ > + if (invalidate_inode_page(vmf->page)) > + poisonret = 0; > unlock_page(vmf->page); > + } > put_page(vmf->page); > vmf->page = NULL; > - return VM_FAULT_HWPOISON; > + return poisonret; > } > > if (unlikely(!(ret & VM_FAULT_LOCKED))) > >