> One big question here: are memory failure #MC exceptions synchronous > or can they be delayed? If we get a memory failure, is it possible > that the #MC hits some random context and not the actual context where > the error occurred? There are a few cases: 1) SRAO (Software recoverable action optional) [Patrol scrub or L3 cache eviction] These aren't synchronous with any core execution. Using machine check to signal was probably a mistake - compounded by it being broadcast :-( Could pick any CPU to handle (actually choose the first to arrive in do_machine_check()). That guy should arrange to soft offline the affected page. Every CPU can return to what they were doing before. 2) SRAR (Software recoverable action required) These are synchronous. Starting with Skylake they may be signaled just to the thread that hit the poison. Earlier generations broadcast. 2a) Hit in ring3 code ... we want to offline the page and SIGBUS the task(s) 2b) Memcpy_mcsafe() ... kernel has a recovery path. "Return" to the recovery code instead of to the original RIP. 2c) copy_from_user ... not implemented yet. We are in kernel, but would like to treat this like case 2a 3) Fatal Always broadcast. Some bank has MCi_STATUS.PCC==1. System must be shutdown. -Tony