> This one is against 6.1 (previous one was against v6.9-rc2): > Again, compile tested only Oscar. Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away. Here's what's happening. When the machine check occurs there's a scramble from various subsystems to report the memory error. ghes_do_memory_failure() calls memory_failure_queue() which later calls memory_failure() from a kernel thread. Side note: this happens TWICE for each error. Not sure yet if this is a BIOS issue logging more than once. or some Linux issues in acpi/apei/ghes.c code. uc_decode_notifier() [called from a different kernel thread] also calls do_memory_failure() Finally kill_me_maybe() [called from task_work on return to the application when returning from the machine check handler] also calls memory_failure() do_memory_failure() is somewhat prepared for multiple reports of the same error. It uses an atomic test and set operation to mark the page as poisoned. First called to report the error does all the real work. Late arrivals take a shorter path, but may still take some action(s) depending on the "flags" passed in: if (TestSetPageHWPoison(p)) { pr_err("%#lx: already hardware poisoned\n", pfn); res = -EHWPOISON; if (flags & MF_ACTION_REQUIRED) res = kill_accessing_process(current, pfn, flags); if (flags & MF_COUNT_INCREASED) put_page(p); goto unlock_mutex; } In this case the last to arrive has MF_ACTION_REQUIRED set, so calls kill_accessing_process() ... which is in the stack trace that led to the: kernel BUG at include/linux/swapops.h:88! I'm not sure that I fully understand your patch. I guess that it is making sure to handle the case that the page has already been marked as poisoned? Anyway ... thanks for the quick fix. I hope the above helps write a good commit message to get this applied and backported to stable. Tested-by: Tony Luck <tony.luck@xxxxxxxxx> -Tony