On Wed, Apr 22, 2020 at 2:06 PM Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: > > On Mon, Apr 13, 2020 at 05:09:45PM -0700, Jim Mattson wrote: > > Fixes: f4124500c2c13 ("KVM: nVMX: Fully emulate preemption timer") > > Signed-off-by: Jim Mattson <jmattson@xxxxxxxxxx> > > Reviewed-by: Oliver Upton <oupton@xxxxxxxxxx> > > Reviewed-by: Peter Shier <pshier@xxxxxxxxxx> > > ... > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index 83050977490c..aae01253bfba 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -4682,7 +4682,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) > > if (is_icebp(intr_info)) > > WARN_ON(!skip_emulated_instruction(vcpu)); > > > > - kvm_queue_exception(vcpu, DB_VECTOR); > > + kvm_requeue_exception(vcpu, DB_VECTOR); > > This isn't wrong per se, but it's effectively papering over an underlying > bug, e.g. the same missed preemption timer bug can manifest if the timer > expires while in KVM context (because the hr timer is left running) and KVM > queues an exception for _any_ reason. Most of the scenarios where L0 will > queue an exception for L2 are fairly contrived, but they are possible. I'm now thinking that this is actually wrong, since requeued exceptions may make their way into the vmcs12 IDT-vectoring info, and I believe that ultimately the vmcs12 IDT-vectoring info should be populated only from the vmcs02 IDT-vectoring info (by whatever circuitous route KVM likes). > I believe the correct fix is to open a "preemption timer window" like we do > for pending SMI, NMI and IRQ. It's effectively the same handling a pending > SMI on VMX, set req_immediate_exit in the !inject_pending_event() path. The KVM code that deals with all of these events is really hard to follow. I wish we could take a step back and just implement Table 6-2 from the SDM volume 3 (augmented with the scattered information about VMX events and their priorities relative to their nearest neighbors. Lumping priorities 7 - 10 together (faults that we either intercepted or synthesized in emulation), I think these are the various things we need to check, in this order... 0. Is there a fault to be delivered? (In L2, is it intercepted by L1?) 1. Is there a RESET or machine check event? 2. Is there a trap on task switch? 3. Is there an SMI or an INIT? 3.5 In L2, is there an MTF VM-exit? 4. Is there a #DB trap on the previous instruction? (In L2, is it intercepted by L1?) 4.3 In L2, has the VMX-preemption timer expired? 4.6 In L2, do we need to synthesize an NMI-window VM-exit? 5. Is there an NMI? (In L2, is it intercepted by L1?) 5.3 In L2 do we need to synthesize an interrupt-window VM-exit? 5.6 In L2, do we need to virtualize virtual-interrupt delivery? 6. Is there a maskable interrupt? (In L2, is it intercepted by L1?) 7. Now, we can enter VMX non-root mode.