Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 6 Apr 2022 20:52:31 +0000

On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 21:48, Sean Christopherson wrote:
> > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> > > On 6.04.2022 19:10, Sean Christopherson wrote:
> > > > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> > > And what if it's L0 that is trying to inject a NMI into L2?
> > > In this case is_guest_mode() is true, but the full NMI injection machinery
> > > should be used.
> > 
> > Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
> > was thinking that NMIs always exit.  The "L1 wants" part should be conditioned on
> > NMI exiting being enabled.  It's benign because KVM always wants "real" NMIs, and
> > so the path is never encountered.
> > 
> > @@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
> >          switch ((u16)exit_reason.basic) {
> >          case EXIT_REASON_EXCEPTION_NMI:
> >                  intr_info = vmx_get_intr_info(vcpu);
> > -               if (is_nmi(intr_info))
> > +               if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
> >                          return true;
> >                  else if (is_page_fault(intr_info))
> >                          return true;
> > 
> 
> I guess you mean "benign" when having KVM as L1, since other hypervisors may
> let their L2s handle NMIs themselves.

No, this one's truly benign.  The nVMX exit processing is:

	if (nested_vmx_l0_wants_exit())
		handle in L0 / KVM;

	if (nested_vmx_l1_wants_exit())
		handle in L1

	handle in L0 / KVM

Since this is for actual hardware NMIs, the first "L0 wants" check always returns
true for NMIs, so the fact that KVM screws up L1's wants is a non-issue.

> > > It is also incorrect to block L1 -> L2 NMI injection because either L1
> > > or L2 is currently under NMI blocking: the first case is obvious,
> > > the second because it's L1 that is supposed to take care of proper NMI
> > > blocking for L2 when injecting an NMI there.
> > 
> > Yep, but I don't think there's a bug here.  At least not for nVMX.
> 
> I agree this scenario should currently work (including on nSVM) - mentioned
> it just as a constraint on solution space.
> 
> > > > > With the code in my previous patch set I planned to use
> > > > > exit_during_event_injection() to detect such case, but if we implement
> > > > > VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
> > > > > comes from L1, so its normal injection side-effects should be skipped.
> > > > 
> > > > Do we still need a flag based on the above?  Honest question... I've been staring
> > > > at all this for the better part of an hour and may have lost track of things.
> > > 
> > > If checking just is_guest_mode() is not enough due to reasons I described
> > > above then we need to somehow determine in the NMI / IRQ injection handler
> > > whether the event to be injected into L2 comes from L0 or L1.
> > > For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.
> > 
> > Yes :-(  And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.
> > 
> 
> Another option for saving and restoring a VM would be to add it to
> KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
> control area?).

Ooh.  What if we keep nested_run_pending=true until the injection completes?  Then
we don't even need an extra flag because nested_run_pending effectively says that
any and all injected events are for L1=>L2.  In KVM_GET_NESTED_STATE, shove the
to-be-injected event into the normal vmc*12 injection field, and ignore all
to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.

That should work even for migrating to an older KVM, as keeping nested_run_pending
will cause the target to reprocess the event injection as if it were from nested
VM-Enter, which it technically is.

We could probably get away with completely dropping the intermediate event as
the vmc*12 should still have the original event, but that technically could result
in architecturally incorrect behavior, e.g. if vectoring up to the point of
interception sets A/D bits in the guest.