Re: [PATCH v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Tue, 7 Apr 2020 21:48:02 -0700

> On Apr 7, 2020, at 3:48 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> 
> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>>>> On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>>>> “Page is malfunctioning” is tricky because you *must* deliver the
>>>> event. x86’s #MC is not exactly a masterpiece, but it does kind of
>>>> work.
>>> 
>>> Nooooo. This does not need #MC at all. Don't even think about it.
>> 
>> Yessssssssssss.  Please do think about it. :)
> 
> I stared too much into that code recently that even thinking about it
> hurts. :)
> 
>>> The point is that the access to such a page is either happening in user
>>> space or in kernel space with a proper exception table fixup.
>>> 
>>> That means a real #PF is perfectly fine. That can be injected any time
>>> and does not have the interrupt semantics of async PF.
>> 
>> The hypervisor has no way to distinguish between
>> MOV-and-has-valid-stack-and-extable-entry and
>> MOV-definitely-can’t-fault-here.  Or, for that matter,
>> MOV-in-do_page_fault()-will-recurve-if-it-faults.
> 
> The mechanism which Vivek wants to support has a well defined usage
> scenario, i.e. either user space or kernel-valid-stack+extable-entry.
> 
> So why do you want to route that through #MC? 

To be clear, I hate #MC as much as the next person.  But I think it needs to be an IST vector, and the #MC *vector* isn’t so terrible.  (The fact that we can’t atomically return from #MC and re-enable CR4.MCE in a single instruction is problematic but not the end of the world.). But see below — I don’t think my suggestion should work quite the way you interpreted it.

>>> 
>>> 
>>>  guest -> #PF runs and either sends signal to user space or runs
>>>           the exception table fixup for a kernel fault.
>> 
>> Or guest blows up because the fault could not be recovered using #PF.
> 
> Not for the use case at hand. And for that you really want to use
> regular #PF.
> 
> The scenario I showed above is perfectly legit:
> 
>   guest:
>        copy_to_user()          <- Has extable
>           -> FAULT
> 
> host:
>   Oh, page is not there, give me some time to figure it out.
> 
>   inject async fault
> 
>   guest:
>        handles async fault interrupt, enables interrupts, blocks
> 
> host:
>   Situation resolved, shared file was truncated. Tell guest

All good so far.

> 
>   Inject #MC

No, not what I meant. Host has two sane choices here IMO:

1. Tell the guest that the page is gone as part of the wakeup. No #PF or #MC.

2. Tell guest that it’s resolved and inject #MC when the guest retries.  The #MC is a real fault, RIP points to the right place, etc.

> 
>   
>> 
>> 
>> 1. Access to bad memory results in an async-page-not-present, except
>> that, it’s not deliverable, the guest is killed.
> 
> That's incorrect. The proper reaction is a real #PF. Simply because this
> is part of the contract of sharing some file backed stuff between host
> and guest in a well defined "virtio" scenario and not a random access to
> memory which might be there or not.

The problem is that the host doesn’t know when #PF is safe. It’s sort of the same problem that async pf has now.  The guest kernel could access the problematic page in the middle of an NMI, under pagefault_disable(), etc — getting #PF as a result of CPL0 access to a page with a valid guest PTE is simply not part of the x86 architecture.

> 
> Look at it from the point where async whatever does not exist at all:
> 
>   guest:
>        copy_to_user()          <- Has extable
>           -> FAULT
> 
> host:
>        suspend guest and resolve situation
> 
>        if (page swapped in)
>           resume_guest();
>        else
>           inject_pf();
> 
> And this inject_pf() does not care whether it kills the guest or makes
> it double/triple fault or whatever.
> 
> The 'tell the guest to do something else while host tries to sort it'
> opportunistic thingy turns this into:
> 
>   guest:
>        copy_to_user()          <- Has extable
>           -> FAULT
> 
> host:
>        tell guest to do something else, i.e. guest suspends task
> 
>        if (page swapped in)
>           tell guest to resume suspended task
>        else
>           tell guest to resume suspended task
> 
>   guest resumes and faults again
> 
> host:
>           inject_pf();
> 
> which is pretty much equivalent.

Replace copy_to_user() with some access to a gup-ed mapping with no extable handler and it doesn’t look so good any more.

Of course, the guest will oops if this happens, but the guest needs to be able to oops cleanly. #PF is too fragile for this because it’s not IST, and #PF is the wrong thing anyway — #PF is all about guest-virtual-to-guest-physical mappings.  Heck, what would CR2 be?  The host might not even know the guest virtual address.

> 
>> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s
>> an *architectural* turd. By all means, have a nice simple PV mechanism
>> to tell the #MC code exactly what went wrong, but keep the overall
>> flow the same as in the native case.
> 
> It's a completely different flow as you evaluate PV turd instead of
> analysing the MCE banks and the other error reporting facilities.

I’m fine with the flow being different. do_machine_check() could have entirely different logic to decide the error in PV.  But I think we should reuse the overall flow: kernel gets #MC with RIP pointing to the offending instruction. If there’s an extable entry that can handle memory failure, handle it. If it’s a user access, handle it.  If it’s an unrecoverable error because it was a non-extable kernel access, oops or panic.

The actual PV part could be extremely simple: the host just needs to tell the guest “this #MC is due to memory failure at this guest physical address”.  No banks, no DIMM slot, no rendezvous crap (LMCE), no other nonsense.  It would be nifty if the host also told the guest what the guest virtual address was if the host knows it.