On Fri, Aug 13, 2021 at 03:07:20PM +0000, Luck, Tony wrote: > I'm running the default case from my einj_mem_uc test. That just: > > 1) allocates a page using: > > mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1, 0); > > 2) fills the page with random data (to make sure it has been allocated, and that the kernel can't > do KSM tricks to share this physical page with some other user). > > 3) injects the error at a 1KB offset within the page. > > 4) does a memory read of the poison address. > > > > action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); > > + dump_page(p, "hwpoison unknown page"); > > res = -EBUSY; > > goto unlock_mutex; > > } > > I added that patch against upstream (v5.14-rc5). Here's the dump. The "pfn" matches the physical address where I injected, > and it has the hwpoison flag bit that was set early in memory_failure() --- so this is the right page. > > [ 79.368212] Memory failure: 0x623889: recovery action for unknown page: Ignored > [ 79.375525] page:0000000065ad9479 refcount:3 mapcount:1 mapping:00000000a4ac843b index:0x0 pfn:0x623889 > [ 79.384909] memcg:ff40a569f2966000 > [ 79.388313] aops:shmem_aops ino:4c00 dentry name:"dev/zero" > [ 79.393896] flags: 0x17ffffc088000c(uptodate|dirty|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0x1fffff) > [ 79.403455] raw: 0017ffffc088000c 0000000000000000 dead000000000122 ff40a569f45a7160 > [ 79.411191] raw: 0000000000000000 0000000000000000 0000000300000000 ff40a569f2966000 > [ 79.418931] page dumped because: hwpoison unknown page Thank you for your help. This dump indicates that HWPoisonHandlable() returned false due to the lack of PG_lru flag. In older code before 5.13, get_any_page() does retry with shake_page(), but does not since 5.13, which seems to me the root cause of the issue. So my suggestion is to call shake_page() when HWPoisonHandlable() is false. Could you try checking that the following diff fixes the issue? I could still have better fix (like inserting shake_page() to other retry paths in get_any_page()), but the below is the minimum one. diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 76cc53b2999a..3e770e4f259e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1146,7 +1146,7 @@ static int __get_hwpoison_page(struct page *page) * unexpected races caused by taking a page refcount. */ if (!HWPoisonHandlable(head)) - return 0; + return -EBUSY; if (PageTransHuge(head)) { /* @@ -1199,9 +1199,14 @@ static int get_any_page(struct page *p, unsigned long flags) } goto out; } else if (ret == -EBUSY) { - /* We raced with freeing huge page to buddy, retry. */ - if (pass++ < 3) + /* + * We raced with (possibly temporary) unhandlable + * page, retry. + */ + if (pass++ < 3) { + shake_page(p, 1); goto try_again; + } goto out; } } Thanks, Naoya Horiguchi