Re: [PATCH 3/3] KVM: x86: Decouple APICv activation state from apicv_inhibit_reasons

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 4 Feb 2025 17:41:35 -0800

On Tue, Feb 04, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-04 at 11:18 -0800, Sean Christopherson wrote:
> > On Mon, Feb 03, 2025, Maxim Levitsky wrote:
> > @@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
> >  	kvm_make_request(KVM_REQ_EVENT, vcpu);
> >  	svm_clear_vintr(to_svm(vcpu));
> >  
> > -	/*
> > -	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
> > -	 * In this case AVIC was temporarily disabled for
> > -	 * requesting the IRQ window and we have to re-enable it.
> > -	 *
> > -	 * If running nested, still remove the VM wide AVIC inhibit to
> > -	 * support case in which the interrupt window was requested when the
> > -	 * vCPU was not running nested.
> > -
> > -	 * All vCPUs which run still run nested, will remain to have their
> > -	 * AVIC still inhibited due to per-cpu AVIC inhibition.
> > -	 */
> 
> Please keep these comment that explain why inhibit has to be cleared here,

Ya, I'll make sure there are good comments that capture everything before posting
anything.

> and the code as well.
> 
> The reason is that IRQ window can be requested before nested entry, which
> will lead to VM wide inhibit, and the interrupt window can happen while
> nested because nested hypervisor can opt to not intercept interrupts. If that
> happens, the AVIC will remain inhibited forever.

Ah, because the svm_clear_vintr() triggered by enter_svm_guest_mode()'s

	svm_set_gif(svm, true);

will occur in_guest_mode() == true.  But doesn't that bug exist today?  Clearing
the inhibit relies on interrupt_window_interception() being reached, and so if
the ExtInt is injected into L2, KVM will never reach interrupt_window_interception().

KVM will evaluate pending events in that case:

		    svm->vcpu.arch.nmi_pending ||
		    kvm_cpu_has_injectable_intr(&svm->vcpu) ||
		    kvm_apic_has_pending_init_or_sipi(&svm->vcpu))
			kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);

but the nested pending VMRUN will block events and not get to the point where
KVM enables an IRQ window:

		r = can_inject ? kvm_x86_call(interrupt_allowed)(vcpu, true) :
				 -EBUSY;
		if (r < 0)
			goto out; <======= Taken for nested VMRUN pending
		if (r) {
			int irq = kvm_cpu_get_interrupt(vcpu);

			if (!WARN_ON_ONCE(irq == -1)) {
				kvm_queue_interrupt(vcpu, irq, false);
				kvm_x86_call(inject_irq)(vcpu, false);
				WARN_ON(kvm_x86_call(interrupt_allowed)(vcpu, true) < 0);
			}
		}
		if (kvm_cpu_has_injectable_intr(vcpu))
			kvm_x86_call(enable_irq_window)(vcpu);

and if L2 has IRQs enabled, the subsequent KVM_REQ_EVENT processing after the
immediate exit will inject into L2, without re-opening a window and thus without
triggering interrupt_window_interception().

Ha!  I was going to suggest hooking svm_leave_nested(), and lo and behold KVM
already does that for KVM_REQ_APICV_UPDATE to ensure it has an up-to-date view
of the inhibits.

After much staring, I suspect the reason KVM hooks interrupt_window_interception()
and not svm_clear_vintr() is to avoid thrashing the inhibit and thus the VM-wide
lock/state on nested entry/exit, e.g. on a typical:

	gif=0 => VMRUN => gif=1 => #VMEXIT => gif=0

sequence, KVM would clear the inhibit while running L2, and then re-enable the
inhibit when control transers back to L1.  But that doesn't gel with this comment
in interrupt_window_interception():

	* If running nested, still remove the VM wide AVIC inhibit to
	* support case in which the interrupt window was requested when the
	* vCPU was not running nested.

That code/comment exists specifically for the case where L1 is not intercepting
IRQs, and so KVM is effectively using V_IRQ in vmcb02 to detect interrupt windows
for L1 IRQs.

If L1 _is_ intercepting IRQs, then interrupt_window_interception() will never be
reached while L2 is active, because the only reason KVM would set the V_IRQ intercept
in vmcb02 would be on behalf of L1, i.e. because of vmcs12.  svm_clear_vintr()
always operates on (at least) vmcb01, and VMRUN unconditionally sets GIF=1, which
means that enter_svm_guest_mode() will always do svm_clear_vintr() via
svm_set_gif(svm, true).  I.e. KVM will keep the VM-wide inhibit set until control
transfers back to L1 *and* an interrupt window is triggered.

Even the "L1 doesn't intercept IRQs" case is incongruous, because if IRQs are
enabled in L2 at the time of VMRUN, KVM will immediately inject L1's ExtINT into
L2 without taking an interrupt window #VMEXIT.

I don't see a perfect solution.  Barring a rather absurd number of checks, KVM
will inevitably leave the VM-wide inhibit in place while running L2, or KVM will
risk thrashing the VM-wide inhibit across VMRUN.

/facepalm

It's still not perfect, but the obvious-in-hindsight answer is to clear the
IRQ window inhibit when KVM actually injects an interrupt and there's no longer
a injectable interrupt.

That optimizes all the paths: if L1 isn't intercept IRQs, KVM will drop the
inhibit as soon as an interrupt is injected into L2.  If L1 is intercepting IRQs,
KVM will keep the inhibit until the IRQ is injected into L2.  Unless I'm missing
something, the inhibit itself should prevent an injectable IRQ from disappearing,
i.e. AVIC won't be left inhibited.

So this for the SVM changes?

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7640a84e554a..2a5cf7029b26 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3219,20 +3219,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	svm_clear_vintr(to_svm(vcpu));
 
-	/*
-	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
-	 * In this case AVIC was temporarily disabled for
-	 * requesting the IRQ window and we have to re-enable it.
-	 *
-	 * If running nested, still remove the VM wide AVIC inhibit to
-	 * support case in which the interrupt window was requested when the
-	 * vCPU was not running nested.
-
-	 * All vCPUs which run still run nested, will remain to have their
-	 * AVIC still inhibited due to per-cpu AVIC inhibition.
-	 */
-	kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
-
 	++vcpu->stat.irq_window_exits;
 	return 1;
 }
@@ -3670,6 +3656,23 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 	struct vcpu_svm *svm = to_svm(vcpu);
 	u32 type;
 
+	/*
+	 * If AVIC was inhibited in order to detect an IRQ window, and there's
+	 * no other injectable interrupts pending or L2 is active (see below),
+	 * then drop the inhibit as the window has served its purpose.
+	 *
+	 * If L2 is active, this path is reachable if L1 is not intercepting
+	 * IRQs, i.e. if KVM is injecting L1 IRQs into L2.  AVIC is locally
+	 * inhibited while L2 is active; drop the VM-wide inhibit to optimize
+	 * the case in which the interrupt window was requested while L1 was
+	 * active (the vCPU was not running nested).
+	 */
+	if (svm->avic_irq_window &&
+	    (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu))) {
+		svm->avic_irq_window = false;
+		kvm_dec_apicv_irq_window(svm->vcpu.kvm);
+	}
+
 	if (vcpu->arch.interrupt.soft) {
 		if (svm_update_soft_interrupt_rip(vcpu))
 			return;
@@ -3881,17 +3884,28 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
 	 */
 	if (vgif || gif_set(svm)) {
 		/*
-		 * IRQ window is not needed when AVIC is enabled,
-		 * unless we have pending ExtINT since it cannot be injected
-		 * via AVIC. In such case, KVM needs to temporarily disable AVIC,
-		 * and fallback to injecting IRQ via V_IRQ.
+		 * KVM only enables IRQ windows when AVIC is enabled if there's
+		 * pending ExtINT since it cannot be injected via AVIC (ExtINT
+		 * bypasses the local APIC).  V_IRQ is ignored by hardware when
+		 * AVIC is enabled, and so KVM needs to temporarily disable
+		 * AVIC in order to detect when it's ok to inject the ExtINT.
 		 *
-		 * If running nested, AVIC is already locally inhibited
-		 * on this vCPU, therefore there is no need to request
-		 * the VM wide AVIC inhibition.
+		 * If running nested, AVIC is already locally inhibited on this
+		 * vCPU (L2 vCPUs use a different MMU that never maps the AVIC
+		 * backing page), therefore there is no need to increment the
+		 * VM-wide AVIC inhibit.  KVM will re-evaluate events when the
+		 * vCPU exits to L1 and enable an IRQ window if the ExtINT is
+		 * still pending.
+		 *
+		 * Note, the IRQ window inhibit needs to be updated even if
+		 * AVIC is inhibited for a different reason, as KVM needs to
+		 * keep AVIC inhibited if the other reason is cleared and there
+		 * is still an injectable interrupt pending.
 		 */
-		if (!is_guest_mode(vcpu))
-			kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
+		if (enable_apicv && !svm->avic_irq_window && !is_guest_mode(vcpu)) {
+			svm->avic_irq_window = true;
+			kvm_inc_apicv_irq_window(vcpu->kvm);
+		}
 
 		svm_set_vintr(svm);
 	}