On Mon, Feb 14, 2022 at 08:37:26PM -0500, Rik van Riel wrote: > On Mon, 2022-02-14 at 15:24 -0800, Andrew Morton wrote: > > > > > Subject: [PATCH v2] mm: clean up hwpoison page cache page in fault > > > path > > > > At first scan I thought this was a code cleanup. > > > > I think I'll do s/clean up/invalidate/. > > > OK, that sounds good. > > > On Sat, 12 Feb 2022 21:37:40 -0500 Rik van Riel <riel@xxxxxxxxxxx> > > wrote: > > > > > Sometimes the page offlining code can leave behind a hwpoisoned > > > clean > > > page cache page. > > > > Is this correct behaviour? > > It is not desirable, and the soft page offlining code > tries to invalidate the page, but I don't think overhauling > the way we lock and refcount page cache pages just to make > offlining them more reliable would be worthwhile, when we > already have a branch in the page fault handler to deal with > these pages, anyway. I don't have any idea about how this kind of page is left on page cache after page offlining. But I agree with the suggested change. > > > > This can lead to programs being killed over and over > > > and over again as they fault in the hwpoisoned page, get killed, > > > and > > > then get re-spawned by whatever wanted to run them. > > > > > > This is particularly embarrassing when the page was offlined due to > > > having too many corrected memory errors. Now we are killing tasks > > > due to them trying to access memory that probably isn't even > > > corrupted. > > > > > > This problem can be avoided by invalidating the page from the page > > > fault handler, which already has a branch for dealing with these > > > kinds of pages. With this patch we simply pretend the page fault > > > was successful if the page was invalidated, return to userspace, > > > incur another page fault, read in the file from disk (to a new > > > memory page), and then everything works again. > > > > Is this worth a cc:stable? > > Maybe. I don't know how far back this issue goes... This issue should be orthogonal with recent changes on hwpoison, and the base code targetted by this patch is unchanged since 2016 (4.10-rc1), so this patch is simply applicable to most of the maintained stable trees (maybe except 4.9.z). Acked-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> Thanks, Naoya Horiguchi