Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin <linmiaohe@xxxxxxxxxx> and Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: [ 1647.723403] mce: [Hardware Error]: Machine check events logged Machine check from kernel copy routine [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400 x86 fault handler sends SIGBUS to child process [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered Async call to memory_failure() from copy on write path [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned uc_decode_notifier() processes memory controller report [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400 Parent process tries to read poisoned page. Page has been unmapped, so #PF handler sends SIGBUS Tony Luck (2): mm, hwpoison: Try to recover from copy-on write faults mm, hwpoison: When copy-on-write hits poison, take page offline include/linux/highmem.h | 24 ++++++++++++++++++++++++ include/linux/mm.h | 5 ++++- mm/memory.c | 32 ++++++++++++++++++++++---------- 3 files changed, 50 insertions(+), 11 deletions(-) -- 2.37.3