Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Tue, 24 Nov 2020 12:39:13 +0100

On 24/11/20 02:55, Sean Christopherson wrote:
I believe there is a 1-to-many relationship here, which is why I said
each CPU would need to maintain a linked list of possible vCPUs to
iterate and find the intended recipient.

Ya, the concern is that it's theoretically possible for the PINV to arrive in L0
after a different vCPU has been loaded (or even multiple different vCPUs).
E.g. if the sending pCPU is hit with an NMI after checking vcpu->mode, and the
NMI runs for some absurd amount of time.  If that happens, the PINV handler
won't know which vCPU(s) should get an IRQ injected into L1 without additional
tracking.  KVM would need to set something like nested.pi_pending before doing
kvm_vcpu_trigger_posted_interrupt(), i.e. nothing really changes, it just gets
more complex.

Ah, gotcha.  What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a 
generation count?  Then you reread vcpu->mode after sending the IPI, and 
retry if it does not match.

To guarantee atomicity, the OUTSIDE_GUEST_MODE IN_GUEST_MODE 
EXITING_GUEST_MODE READING_SHADOW_PAGE_TABLES values would remain in the 
bottom 2 bits.  That is, the vcpu->mode accesses like

	vcpu->mode = IN_GUEST_MODE;

	vcpu->mode = OUTSIDE_GUEST_MODE;

	smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES);

	smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);

	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);

becoming

	enum {
		OUTSIDE_GUEST_MODE,
		IN_GUEST_MODE,
		EXITING_GUEST_MODE,
		READING_SHADOW_PAGE_TABLES,
		GUEST_MODE_MASK = 3,
	};

	vcpu->mode = (vcpu->mode | GUEST_MODE_MASK) + 1 + IN_GUEST_MODE;

	vcpu->mode &= ~GUEST_MODE_MASK;

	smp_store_mb(vcpu->mode, vcpu->mode|READING_SHADOW_PAGE_TABLES);

	smp_store_release(&vcpu->mode, vcpu->mode & ~GUEST_MODE_MASK);

	int x = READ_ONCE(vcpu->mode);
	do {
		if ((x & GUEST_MODE_MASK) != IN_GUEST_MODE)
			return x & GUEST_MODE_MASK;
	} while (!try_cmpxchg(&vcpu->mode, &x,
			      x ^ IN_GUEST_MODE ^ EXITING_GUEST_MODE))
	return IN_GUEST_MODE;

You could still get spurious posted interrupt IPIs, but the IPI handler 
need not do anything at all and that is much better.

if we're ok with KVM
processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.

I actually find it curious that the spec promises posted interrupt 
processing to be triggered only by the arrival of the posted interrupt 
IPI.  Why couldn't the processor in principle snoop for the address of 
the ON bit instead, similar to an MWAIT?

But even without that, I don't think the spec promises that kind of 
strict ordering with respect to what goes on in the source.  Even though 
posted interrupt processing is atomic with the acknowledgement of the 
posted interrupt IPI, the spec only promises that the PINV triggers an 
_eventual_ scan of PID.PIR when the interrupt controller delivers an 
unmasked external interrupt to the destination CPU.  You can still have 
something like

	set PID.PIR[100]
	set PID.ON
					processor starts executing a
					 very slow instruction...
	send PINV
	set PID.PIR[200]
					acknowledge PINV

and then vector 200 would be delivered before vector 100.  Of course 
with nested PI the effect would be amplified, but it's possible even on 
bare metal.

Paolo