Re: [PATCH 2/2] kvm: nVMX: Single-step traps trump expired VMX-preemption timer

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Tue, 21 Apr 2020 17:16:07 -0700

On Tue, Apr 21, 2020 at 11:28:23AM -0700, Jim Mattson wrote:
> The more I look at that call to kvm_clear_exception_queue(), the more
> convinced I am that it's wrong. The comment above it says:
> 
> /*
> * Drop what we picked up for L2 via vmx_complete_interrupts. It is
> * preserved above and would only end up incorrectly in L1.
> */
> 
> The first sentence is just wrong. Vmx_complete_interrupts may not be
> where the NMI/exception/interrupt came from. And the second sentence
> is not entirely true. Only *injected* events are "preserved above" (by
> the call to vmcs12_save_pending_event). However,
> kvm_clear_exception_queue zaps both injected events and pending
> events. Moreover, vmcs12_save_pending_event "preserves" the event by
> stashing it in the IDT-vectoring info field of vmcs12, even when the
> current VM-exit (from L2 to L1) did not (and in some cases cannot)
> occur during event delivery (e.g. VMX-preemption timer expired).

The comments are definitely wrong, or perhaps incomplete, but the behavior
isn't "wrong, assuming the rest of the nested event handling is correct
(hint, it's not).  Even with imperfect event handling, clearing
queued/pending exceptions on nested exit isn't terrible behavior as VMX
doesn't have a notion of queued/pending excpetions (ignoring #DBs for now).
If the VM-Exit occured while vectoring an event, that event is captured in
IDT_VECTORING_INFO.  Everything else that might have happened is an
alternate timeline.

The piece that's likely missing is updating GUEST_PENDING_DBG_EXCEPTIONS if
a single-step #DB was pending after emulation.  KVM currently propagates
the field as-is from vmcs02, which would miss any emulated #DB.  But, that
is effectively orthogonal to kvm_clear_exception_queue(), e.g. KVM needs to
"save" the pending #DB like it "saves" an injected event, but clearing the
#DB from the queue is correct.

The main issue I have with the nested event code is that the checks are
effectively grouped by exiting versus non-exiting instead of being grouped
by priority.  That makes it extremely difficult to correctly prioritize
events because each half needs to know the other's behavior, even though
the code is "separated"  E.g. my suggestion to return immediatel if an NMI
is pending and won't VM-Exit is wrong, as the behavior should also be
conditioned on NMIs being unmasked in L2.  Actually implementing that check
is a gigantic mess in the current code base and simply isn't worth the
marginal benefit.

Unwinding the mess, i.e. processing each event class exactly once per run
loop, is extremely difficult because there one-off cases that have been
piled on top, e.g. calling check_nested_events() from kvm_vcpu_running()
and processing INIT and SIPI in kvm_apic_accept_events(), whose full impact
is impossible to ascertain simply by reading the code.  The KVM_REQ_EVENT
optimization also exacerbates the problem, i.e. the event checking really
should be done unconditionally on every loop.