On Thu, 25 Mar 2021 17:02:35 -0700, Tony Luck wrote: ... > But there are places in the kernel where the code assumes that this > EFAULT return was simply because of a page fault. The code takes some > action to fix that, and then retries the access. This results in a second > machine check. What about return EHWPOISON instead of EFAULT and update the callers to handle EHWPOISON explicitly: i.e., not retry but give up on the page? My main concern is that the strong assumptions that the kernel can't hit more than a fixed number of poisoned cache lines before turning to user space may simply not be true. When DIMM goes bad, it can easily affect an entire bank or entire ram device chip. Even with memory interleaving, it's possible that a kernel control path touches lots of poisoned cache lines in the buffer it is working through.