Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Wed, 25 Nov 2020 01:10:46 +0100

On 24/11/20 22:22, Sean Christopherson wrote:
What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a
generation count?  Then you reread vcpu->mode after sending the IPI, and
retry if it does not match.
The sender will have seen IN_GUEST_MODE and so won't retry the IPI,
but hardware didn't process the PINV as a posted-interrupt.
Uff, of course.

That said, it may still be a good idea to keep pi_pending only as a very 
short-lived flag only to handle this case, and maybe not even need the 
generation count (late here so put up with me if it's wrong :)).  The 
flag would not have to live past vmx_vcpu_run even, the vIRR[PINV] bit 
would be the primary marker that a nested posted interrupt is pending.

if we're ok with KVM
processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.

I actually find it curious that the spec promises posted interrupt
processing to be triggered only by the arrival of the posted interrupt IPI.
Why couldn't the processor in principle snoop for the address of the ON bit
instead, similar to an MWAIT?

It would lead to false positives and missed IRQs.

Not to missed IRQs---false positives on the monitor would be possible, 
but you would still have to send a posted interrupt IPI.  The weird 
promise is that the PINV interrupt is the _only_ trigger for posted 
interrupts.

But even without that, I don't think the spec promises that kind of strict
ordering with respect to what goes on in the source.  Even though posted
interrupt processing is atomic with the acknowledgement of the posted
interrupt IPI, the spec only promises that the PINV triggers an _eventual_
scan of PID.PIR when the interrupt controller delivers an unmasked external
interrupt to the destination CPU.  You can still have something like

	set PID.PIR[100]
	set PID.ON
					processor starts executing a
					 very slow instruction...
	send PINV
	set PID.PIR[200]
					acknowledge PINV

and then vector 200 would be delivered before vector 100.  Of course with
nested PI the effect would be amplified, but it's possible even on bare
metal.

Jim was concerned that L1 could poll the PID to determine whether or not
PID.PIR[200] should be seen in L2.  The whole PIR is copied to the vIRR after
PID.ON is cleared the auto-EOI is done, and the read->clear is atomic.  So the
above sequence where PINV is acknowledge after PID.PIR[200] is legal, but
processing PIR bits that are set after the PIR is observed to be cleared would
be illegal.

That would be another case of the unnecessarily specific promise above.

E.g. if L1 did this

	set PID.PIR[100]
	set PID.ON
	send PINV
	while (PID.PIR)
	set PID.PIR[200]
	set PID.ON

This is the part that is likely impossible to
solve without shadowing the PID (which, for the record, I have zero desire to do).

Neither do I. :)  But technically the SDM doesn't promise reading the 
whole 256 bits at the same time.  Perhaps that's the only way it can 
work in practice due to the cache coherency protocols, but the SDM only 
promises atomicity of the read and clear of "a single PIR bit (or group 
of bits)".  So there's in principle no reason why the target CPU 
couldn't clear PID.PIR[100], and then L1 would sneak in and set 
PID.PIR[200].

Paolo

It seems extremely unlikely any guest will puke on the above, I can't imagine
there's for setting a PID.PIR + PID.ON without triggering PINV, but it's
technically bad behavior in KVM.