On 2024/5/28 23:41, Luck, Tony wrote:
+ if (unlikely(folio_mc_copy(dst, src))) {
+ folio_ref_unfreeze(src, expected_count);
+ return -EFAULT;
It doesn't look like any code takes action to avoid re-using the poisoned page.
So you survived, hurrah! But left the problem page for some other code to trip over.
Hi Tony, thanks for your review,
We tried to avoid calling memory_failure_queue() after
copy_mc_{user_}highpage(), and I think the memory_failure() should be
called by ARCH's code(eg, mce in x86)[1] to handle the poisoned page,
but for current mainline, the x86 mce don't do that, so yes, we need a
memory_failure_queue() for x86, but it is not true for upcoming
arm64, the poisoned page is handled by apei_claim_sea(),and a new
memory_failure_queue() is unnecessary(no issue since the
TestSetPageHWPoison() check in memory_failure()).
It seems that the khugepaged[3][4] should do the same thing, we could
call memory_failure_queue() in copy_mc_{user_}highpage(), and remove
it from each caller, is that OK?
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 00341b56d291..6b0d6f3c8580 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -352,6 +352,9 @@ static inline int copy_mc_user_highpage(struct page
*to, struct page *from,
kunmap_local(vto);
kunmap_local(vfrom);
+ if (ret)
+ memory_failure_queue(page_to_pfn(from), 0);
+
return ret;
}
@@ -368,6 +371,9 @@ static inline int copy_mc_highpage(struct page *to,
struct page *from)
kunmap_local(vto);
kunmap_local(vfrom);
+ if (ret)
+ memory_failure_queue(page_to_pfn(from), 0);
+
return ret;
}
Thanks.
[1]
https://lore.kernel.org/linux-mm/20240204082627.3892816-3-tongtiangen@xxxxxxxxxx/
[2]
https://lore.kernel.org/linux-mm/20240528085915.1955987-1-tongtiangen@xxxxxxxxxx/
[3] 12904d953364 mm/khugepaged: recover from poisoned file-backed memory
[4] 6efc7afb5cc9 mm/hwpoison: introduce copy_mc_highpage
-Tony