On Tue, Apr 21, 2020 at 11:28:23AM -0700, Jim Mattson wrote: > The more I look at that call to kvm_clear_exception_queue(), the more > convinced I am that it's wrong. The comment above it says: > > /* > * Drop what we picked up for L2 via vmx_complete_interrupts. It is > * preserved above and would only end up incorrectly in L1. > */ > > The first sentence is just wrong. Vmx_complete_interrupts may not be > where the NMI/exception/interrupt came from. And the second sentence > is not entirely true. Only *injected* events are "preserved above" (by > the call to vmcs12_save_pending_event). However, > kvm_clear_exception_queue zaps both injected events and pending > events. Moreover, vmcs12_save_pending_event "preserves" the event by > stashing it in the IDT-vectoring info field of vmcs12, even when the > current VM-exit (from L2 to L1) did not (and in some cases cannot) > occur during event delivery (e.g. VMX-preemption timer expired). The comments are definitely wrong, or perhaps incomplete, but the behavior isn't "wrong, assuming the rest of the nested event handling is correct (hint, it's not). Even with imperfect event handling, clearing queued/pending exceptions on nested exit isn't terrible behavior as VMX doesn't have a notion of queued/pending excpetions (ignoring #DBs for now). If the VM-Exit occured while vectoring an event, that event is captured in IDT_VECTORING_INFO. Everything else that might have happened is an alternate timeline. The piece that's likely missing is updating GUEST_PENDING_DBG_EXCEPTIONS if a single-step #DB was pending after emulation. KVM currently propagates the field as-is from vmcs02, which would miss any emulated #DB. But, that is effectively orthogonal to kvm_clear_exception_queue(), e.g. KVM needs to "save" the pending #DB like it "saves" an injected event, but clearing the #DB from the queue is correct. The main issue I have with the nested event code is that the checks are effectively grouped by exiting versus non-exiting instead of being grouped by priority. That makes it extremely difficult to correctly prioritize events because each half needs to know the other's behavior, even though the code is "separated" E.g. my suggestion to return immediatel if an NMI is pending and won't VM-Exit is wrong, as the behavior should also be conditioned on NMIs being unmasked in L2. Actually implementing that check is a gigantic mess in the current code base and simply isn't worth the marginal benefit. Unwinding the mess, i.e. processing each event class exactly once per run loop, is extremely difficult because there one-off cases that have been piled on top, e.g. calling check_nested_events() from kvm_vcpu_running() and processing INIT and SIPI in kvm_apic_accept_events(), whose full impact is impossible to ascertain simply by reading the code. The KVM_REQ_EVENT optimization also exacerbates the problem, i.e. the event checking really should be done unconditionally on every loop.