+ mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Sat, 22 May 2021 15:09:47 -0700

The patch titled
     Subject: mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned
has been added to the -mm tree.  Its filename is
     mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned.patch

This patch should soon appear at
    https://ozlabs.org/~akpm/mmots/broken-out/mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned.patch
and later at
    https://ozlabs.org/~akpm/mmotm/broken-out/mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Aili Yao <yaoaili@xxxxxxxxxxxx>
Subject: mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned

When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS to
the affected process, which results in infinite loop of MCEs.

Currently memory_failure() returns 0 if it's called for already hwpoisoned
page, then the caller, kill_me_maybe(), could return without sending
SIGBUS to current process.  An action required MCE is raised when the
current process accesses to the broken memory, so no SIGBUS means that the
current process continues to run and access to the error page again soon,
so running into MCE loop.

This issue can arise for example in the following scenarios:

- Two or more threads access to the poisoned page concurrently.  If
  local MCE is enabled, MCE handler independently handles the MCE events. 
  So there's a race among MCE events, and the second or latter threads
  fall into the situation in question.

- If there was a precedent memory error event and memory_failure() for
  the event failed to unmap the error page for some reason, the subsequent
  memory access to the error page triggers the MCE loop situation.

To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned.  This allows memory error handler
to control how it sends signals to userspace.  And make sure that any
process touching a hwpoisoned page should get a SIGBUS even in "already
hwpoisoned" path of memory_failure() as is done in page fault path.

Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@xxxxxxxxx
Signed-off-by: Aili Yao <yaoaili@xxxxxxxxxxxx>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
Cc: Andy Lutomirski <luto@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: Jue Wang <juew@xxxxxxxxxx>
Cc: Oscar Salvador <osalvador@xxxxxxx>
Cc: Tony Luck <tony.luck@xxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/memory-failure.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/memory-failure.c~mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned
+++ a/mm/memory-failure.c
@@ -1244,7 +1244,7 @@ static int memory_failure_hugetlb(unsign
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return 0;
+		return -EHWPOISON;
 	}
 
 	num_poisoned_pages_inc();
@@ -1452,6 +1452,7 @@ try_again:
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		res = -EHWPOISON;
 		goto unlock_mutex;
 	}
 
_

Patches currently in -mm which might be from yaoaili@xxxxxxxxxxxx are

mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned.patch