On Wed, Jun 15, 2022 at 10:00:05AM +0800, zhenwei pi wrote: > Currently unpoison_memory(unsigned long pfn) is designed for soft > poison(hwpoison-inject) only. Since 17fae1294ad9d, the KPTE gets > cleared on a x86 platform once hardware memory corrupts. > > Unpoisoning a hardware corrupted page puts page back buddy only, > the kernel has a chance to access the page with *NOT PRESENT* KPTE. > This leads BUG during accessing on the corrupted KPTE. > > Suggested by David&Naoya, disable unpoison mechanism when a real HW error > happens to avoid BUG like this: ... > > Fixes: 847ce401df392 ("HWPOISON: Add unpoisoning support") > Fixes: 17fae1294ad9d ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned") > Cc: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > Cc: David Hildenbrand <david@xxxxxxxxxx> > Signed-off-by: zhenwei pi <pizhenwei@xxxxxxxxxxxxx> Cc to stable? I think that the current approach seems predictable to me than earlier versions, so I can agree with sending this to stable a little more confidently. > --- > Documentation/vm/hwpoison.rst | 3 ++- > drivers/base/memory.c | 2 +- > include/linux/mm.h | 1 + > mm/hwpoison-inject.c | 2 +- > mm/madvise.c | 2 +- > mm/memory-failure.c | 12 ++++++++++++ > 6 files changed, 18 insertions(+), 4 deletions(-) > ... > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index b85661cbdc4a..385b5e99bfc1 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -69,6 +69,8 @@ int sysctl_memory_failure_recovery __read_mostly = 1; > > atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); > > +static bool hw_memory_failure; Could you set the initial value explicitly? Using a default value is good, but doing as the surrounding code do is better for consistency. And this variable can be updated only once, so adding __read_mostly macro is also fine. Thanks, Naoya Horiguchi