Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 25 Nov 2020 01:14:16 +0000

On Wed, Nov 25, 2020, Paolo Bonzini wrote:
> On 24/11/20 22:22, Sean Christopherson wrote:
> > > What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a
> > > generation count?  Then you reread vcpu->mode after sending the IPI, and
> > > retry if it does not match.
> > The sender will have seen IN_GUEST_MODE and so won't retry the IPI,
> > but hardware didn't process the PINV as a posted-interrupt.
> Uff, of course.
> 
> That said, it may still be a good idea to keep pi_pending only as a very
> short-lived flag only to handle this case, and maybe not even need the
> generation count (late here so put up with me if it's wrong :)).  The flag
> would not have to live past vmx_vcpu_run even, the vIRR[PINV] bit would be
> the primary marker that a nested posted interrupt is pending.

I 100% agree that would be ideal for nested state migration, as it would avoid
polluting the ABI with nested.pi_pending, but the downside is that it would mean
KVM could deliver a physical interrupt twice; once to L2 and once to L1.  E.g.

	while (READ_ONCE(vmx->nested.pi_pending) && PID.ON) {
		vmx->nested.pi_pending = false;
		vIRR.PINV = 1;
	}

would incorrectly set vIRR.PINV in the case where hardware handled the PI, and
that could result in L1 seeing the interrupt if a nested exit occured before KVM
processed vIRR.PINV for L2.  Note, without PID.ON, the behavior would be really
bad as KVM would set vIRR.PINV *every* time hardware handled the PINV.

The current behavior just means KVM is more greedy with respect to processing
PID.PIR than it technically should be.  Not sure if that distinction is worth
carrying nested.pi_pending.

> > > > if we're ok with KVM
> > > > processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
> > > > the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.
> > > 
> > > I actually find it curious that the spec promises posted interrupt
> > > processing to be triggered only by the arrival of the posted interrupt IPI.
> > > Why couldn't the processor in principle snoop for the address of the ON bit
> > > instead, similar to an MWAIT?
> > 
> > It would lead to false positives and missed IRQs.
> 
> Not to missed IRQs---false positives on the monitor would be possible, but
> you would still have to send a posted interrupt IPI.  The weird promise is
> that the PINV interrupt is the _only_ trigger for posted interrupts.

Ah, I misunderstood the original "only".  I suspect the primary reason is that
it would cost uops to do the snoop thing and would be inefficient in practice.
The entire PID is guaranteed to be in a single cache line, and most (all?) flows
will write PIR before ON.  So snooping the PID will detect the PIR write, bounce
the cache line by reading it, burn some uops checking that ON isn't set[*], and
then do ???

[*] The pseudocode in the SDM doesn't actually state that the CPU checks PID.ON,
    it only says it clears PID.ON, i.e. assumes that PID.ON is set.  That seems
    like an SDM bug.

> > > But even without that, I don't think the spec promises that kind of strict
> > > ordering with respect to what goes on in the source.  Even though posted
> > > interrupt processing is atomic with the acknowledgement of the posted
> > > interrupt IPI, the spec only promises that the PINV triggers an _eventual_
> > > scan of PID.PIR when the interrupt controller delivers an unmasked external
> > > interrupt to the destination CPU.  You can still have something like
> > > 
> > > 	set PID.PIR[100]
> > > 	set PID.ON
> > > 					processor starts executing a
> > > 					 very slow instruction...
> > > 	send PINV
> > > 	set PID.PIR[200]
> > > 					acknowledge PINV
> > > 
> > > and then vector 200 would be delivered before vector 100.  Of course with
> > > nested PI the effect would be amplified, but it's possible even on bare
> > > metal.
> > 
> > Jim was concerned that L1 could poll the PID to determine whether or not
> > PID.PIR[200] should be seen in L2.  The whole PIR is copied to the vIRR after
> > PID.ON is cleared the auto-EOI is done, and the read->clear is atomic.  So the
> > above sequence where PINV is acknowledge after PID.PIR[200] is legal, but
> > processing PIR bits that are set after the PIR is observed to be cleared would
> > be illegal.
> 
> That would be another case of the unnecessarily specific promise above.
> 
> > E.g. if L1 did this
> > 
> > 	set PID.PIR[100]
> > 	set PID.ON
> > 	send PINV
> > 	while (PID.PIR)
> > 	set PID.PIR[200]
> > 	set PID.ON
> > 
> > This is the part that is likely impossible to
> > solve without shadowing the PID (which, for the record, I have zero desire to do).
> 
> Neither do I. :)  But technically the SDM doesn't promise reading the whole
> 256 bits at the same time.

Hrm, the wording is poor, but my interpretation of this blurb is that the CPU
somehow has a death grip on the PID cache line while it's reading and clearing
the PIR.

  5. The logical processor performs a logical-OR of PIR into VIRR and clears PIR.
     No other agent can read or write a PIR bit (or group of bits) between the
     time it is read (to determine what to OR into VIRR) and when it is cleared.

> Perhaps that's the only way it can work in
> practice due to the cache coherency protocols, but the SDM only promises
> atomicity of the read and clear of "a single PIR bit (or group of bits)".
> So there's in principle no reason why the target CPU couldn't clear
> PID.PIR[100], and then L1 would sneak in and set PID.PIR[200].
> 
> Paolo
> 
> > It seems extremely unlikely any guest will puke on the above, I can't imagine
> > there's for setting a PID.PIR + PID.ON without triggering PINV, but it's
> > technically bad behavior in KVM.
> > 
>