On Tue, Aug 17, 2021 at 02:12:07AM +0900, Naoya Horiguchi wrote: > This dump indicates that HWPoisonHandlable() returned false due to > the lack of PG_lru flag. In older code before 5.13, get_any_page() does > retry with shake_page(), but does not since 5.13, which seems to me > the root cause of the issue. So my suggestion is to call shake_page() > when HWPoisonHandlable() is false. > > Could you try checking that the following diff fixes the issue? > I could still have better fix (like inserting shake_page() to other > retry paths in get_any_page()), but the below is the minimum one. Tried it ... and it works! Injected and recovered from a thousand errors without seeing any problems. -Tony P.S. Somewhere in the mail system your patch arrived with <TAB>s changed to spaces. Here's what I applied to v5.14-rc6 (hopefully with TABS preserved) ... just in case anyone else is following along with this thread and wants to try some tests. diff --git a/mm/memory-failure.c b/mm/memory-failure.c index eefd823deb67..aa6592540f17 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1146,7 +1146,7 @@ static int __get_hwpoison_page(struct page *page) * unexpected races caused by taking a page refcount. */ if (!HWPoisonHandlable(head)) - return 0; + return -EBUSY; if (PageTransHuge(head)) { /* @@ -1199,9 +1199,14 @@ static int get_any_page(struct page *p, unsigned long flags) } goto out; } else if (ret == -EBUSY) { - /* We raced with freeing huge page to buddy, retry. */ - if (pass++ < 3) + /* + * We raced with (possibly temporary) unhandlable + * page, retry. + */ + if (pass++ < 3) { + shake_page(p, 1); goto try_again; + } goto out; } }