> On Jan 12, 2021, at 12:52 PM, Luck, Tony <tony.luck@xxxxxxxxx> wrote: > > On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote: >>> On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@xxxxxxxxx> wrote: >>> >>> On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: >>>> Well, we need to do *something* when the first __get_user() trips the >>>> #MC. It would be nice if we could actually fix up the page tables >>>> inside the #MC handler, but, if we're in a pagefault_disable() context >>>> we might have locks held. Heck, we could have the pagetable lock >>>> held, be inside NMI, etc. Skipping the task_work_add() might actually >>>> make sense if we get a second one. >>>> >>>> We won't actually infinite loop in pagefault_disable() context -- if >>>> we would, then we would also infinite loop just from a regular page >>>> fault, too. >>> >>> Fixing the page tables inside the #MC handler to unmap the poison >>> page would indeed be a good solution. But, as you point out, not possible >>> because of locks. >>> >>> Could we take a more drastic approach? We know that this case the kernel >>> is accessing a user address for the current process. Could the machine >>> check handler just re-write %cr3 to point to a kernel-only page table[1]. >>> I.e. unmap the entire current user process. >> >> That seems scary, especially if we're in the middle of a context >> switch when this happens. We *could* make it work, but I'm not at all >> convinced it's wise. > > Scary? It's terrifying! > > But we know that the fault happend in a get_user() or copy_from_user() call > (i.e. an RIP with an extable recovery address). Does context switch > access user memory? No, but NMI can. The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI. What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it. Can we do the cleanup from an interrupt? IPI-to-self might be a credible approach, if so.