Re: [PATCH] KVM: arm64: vgic-v4: Consistently request doorbell irq for blocking vCPU

Marc Zyngier <maz@xxxxxxxxxx> · Wed, 12 Jul 2023 14:49:15 +0100

On Wed, 12 Jul 2023 13:09:45 +0100,
Zenghui Yu <yuzenghui@xxxxxxxxxx> wrote:
> 
> On 2023/7/11 15:26, Oliver Upton wrote:
> > On Tue, Jul 11, 2023 at 08:23:25AM +0100, Marc Zyngier wrote:
> >> On Mon, 10 Jul 2023 18:55:53 +0100,
> >> Oliver Upton <oliver.upton@xxxxxxxxx> wrote:
> >>> 
> >>> Xiang reports that VMs occasionally fail to boot on GICv4.1 systems when
> >>> running a preemptible kernel, as it is possible that a vCPU is blocked
> >>> without requesting a doorbell interrupt.
> >>> 
> >>> The issue is that any preemption that occurs between vgic_v4_put() and
> >>> schedule() on the block path will mark the vPE as nonresident and *not*
> >>> request a doorbell irq.
> >> 
> >> It'd be worth spelling out. You need to go via *three* schedule()
> >> calls: one to be preempted (with DB set), one to be made resident
> >> again, and then the final one in kvm_vcpu_halt(), clearing the DB on
> >> vcpu_put() due to the bug.
> > 
> > Yeah, a bit lazy in the wording. What I had meant to imply was
> > preemption happening after the doorbell is set up and before the thread
> > has an opportunity to explicitly schedule out. Perhaps I should just say
> > that.
> > 
> >>> 
> >>> Fix it by consistently requesting a doorbell irq in the vcpu put path if
> >>> the vCPU is blocking.
> 
> Yup. Agreed!
> 
> >>> While this technically means we could drop the
> >>> early doorbell irq request in kvm_vcpu_wfi(), deliberately leave it
> >>> intact such that vCPU halt polling can properly detect the wakeup
> >>> condition before actually scheduling out a vCPU.
> 
> Yeah, just like what we did in commit 07ab0f8d9a12 ("KVM: Call
> kvm_arch_vcpu_blocking early into the blocking sequence").
> 
> My only concern is that if the preemption happens before halt polling,
> we would enter the polling loop with VPE already resident on the RD and
> can't recognize any firing GICv4.x virtual interrupts (targeting this
> VPE) in polling. [1]

The status of the pending bit is recorded in pending_last, so we don't
lose what was snapshot at the point of hitting WFI. But we indeed
don't have any idea for something firing during the polling loop.

> Given that making VPE resident on the vcpu block path (i.e., in
> kvm_vcpu_halt()) makes little sense (right?) and leads to this sort of
> problem, a crude idea is that we can probably keep track of the
> "nested" vgic_v4_{put,load} calls (instead of a single vpe->resident
> flag) and keep VPE *not resident* on the whole block path (like what we
> had before commit 8e01d9a396e6). And we then rely on
> kvm_vcpu_wfi/vgic_v4_load to actually schedule the VPE on...

I'm not sure about the nested tracking part, but it's easy enough to
have a vcpu flag indicating that we're in WFI. So an *alternative* to
the current fix would be something like this:

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f54ba0a63669..417a0e85456b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -817,6 +817,8 @@ struct kvm_vcpu_arch {
 #define DBG_SS_ACTIVE_PENDING	__vcpu_single_flag(sflags, BIT(5))
 /* PMUSERENR for the guest EL0 is on physical CPU */
 #define PMUSERENR_ON_CPU	__vcpu_single_flag(sflags, BIT(6))
+/* WFI instruction trapped */
+#define IN_WFI			__vcpu_single_flag(sflags, BIT(7))
 
 /* vcpu entered with HCR_EL2.E2H set */
 #define VCPU_HCR_E2H		__vcpu_single_flag(oflags, BIT(0))
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 236c5f1c9090..cf208d30a9ea 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -725,13 +725,15 @@ void kvm_vcpu_wfi(struct kvm_vcpu *vcpu)
 	 */
 	preempt_disable();
 	kvm_vgic_vmcr_sync(vcpu);
-	vgic_v4_put(vcpu, true);
+	vcpu_set_flag(vcpu, IN_WFI);
+	vgic_v4_put(vcpu);
 	preempt_enable();
 
 	kvm_vcpu_halt(vcpu);
 	vcpu_clear_flag(vcpu, IN_WFIT);
 
 	preempt_disable();
+	vcpu_clear_flag(vcpu, IN_WFI);
 	vgic_v4_load(vcpu);
 	preempt_enable();
 }
@@ -799,7 +801,7 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 		if (kvm_check_request(KVM_REQ_RELOAD_GICv4, vcpu)) {
 			/* The distributor enable bits were changed */
 			preempt_disable();
-			vgic_v4_put(vcpu, false);
+			vgic_v4_put(vcpu);
 			vgic_v4_load(vcpu);
 			preempt_enable();
 		}
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 49d35618d576..df61ead7c757 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -780,7 +780,7 @@ void vgic_v3_put(struct kvm_vcpu *vcpu)
 	 * done a vgic_v4_put) and when running a nested guest (the
 	 * vPE was never resident in order to generate a doorbell).
 	 */
-	WARN_ON(vgic_v4_put(vcpu, false));
+	WARN_ON(vgic_v4_put(vcpu));
 
 	vgic_v3_vmcr_sync(vcpu);
 
diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index c1c28fe680ba..339a55194b2c 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -336,14 +336,14 @@ void vgic_v4_teardown(struct kvm *kvm)
 	its_vm->vpes = NULL;
 }
 
-int vgic_v4_put(struct kvm_vcpu *vcpu, bool need_db)
+int vgic_v4_put(struct kvm_vcpu *vcpu)
 {
 	struct its_vpe *vpe = &vcpu->arch.vgic_cpu.vgic_v3.its_vpe;
 
 	if (!vgic_supports_direct_msis(vcpu->kvm) || !vpe->resident)
 		return 0;
 
-	return its_make_vpe_non_resident(vpe, need_db);
+	return its_make_vpe_non_resident(vpe, !!vcpu_get_flag(vcpu, IN_WFI));
 }
 
 int vgic_v4_load(struct kvm_vcpu *vcpu)
@@ -354,6 +354,9 @@ int vgic_v4_load(struct kvm_vcpu *vcpu)
 	if (!vgic_supports_direct_msis(vcpu->kvm) || vpe->resident)
 		return 0;
 
+	if (vcpu_get_flag(vcpu, IN_WFI))
+		return 0;
+
 	/*
 	 * Before making the VPE resident, make sure the redistributor
 	 * corresponding to our current CPU expects us here. See the
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 9b91a8135dac..765d801d1ddc 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -446,7 +446,7 @@ int kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int irq,
 
 int vgic_v4_load(struct kvm_vcpu *vcpu);
 void vgic_v4_commit(struct kvm_vcpu *vcpu);
-int vgic_v4_put(struct kvm_vcpu *vcpu, bool need_db);
+int vgic_v4_put(struct kvm_vcpu *vcpu);
 
 bool vgic_state_is_nested(struct kvm_vcpu *vcpu);
 

Of course, it is totally untested... ;-) But I like that the doorbell
request is solely driven by the WFI state, and we avoid leaking the
knowledge outside of the vgic code.

Thoughts?

	M.

-- 
Without deviation from the norm, progress is not possible.