Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 24 Nov 2020 01:55:15 +0000

On Mon, Nov 23, 2020 at 04:13:49PM -0800, Oliver Upton wrote:
> On Mon, Nov 23, 2020 at 4:10 PM Oliver Upton <oupton@xxxxxxxxxx> wrote:
> >
> > On Mon, Nov 23, 2020 at 2:42 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> > >
> > > On 23/11/20 20:22, Oliver Upton wrote:
> > > > The pi_pending bit works rather well as it is only a hint to KVM that it
> > > > may owe the guest a posted-interrupt completion. However, if we were to
> > > > set the guest's nested PINV as pending in the L1 IRR it'd be challenging
> > > > to infer whether or not it should actually be injected in L1 or result
> > > > in posted-interrupt processing for L2.
> > >
> > > Stupid question: why does it matter?  The behavior when the PINV is
> > > delivered does not depend on the time it enters the IRR, only on the
> > > time that it enters ISR.  If that happens while the vCPU while in L2, it
> > > would trigger posted interrupt processing; if PINV moves to ISR while in
> > > L1, it would be delivered normally as an interrupt.
> > >
> > > There are various special cases but they should fall in place.  For
> > > example, if PINV is delivered during L1 vmentry (with IF=0), it would be
> > > delivered at the next inject_pending_event when the VMRUN vmexit is
> > > processed and interrupts are unmasked.
> > >
> > > The tricky case is when L0 tries to deliver the PINV to L1 as a posted
> > > interrupt, i.e. in vmx_deliver_nested_posted_interrupt.  Then the
> > >
> > >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
> > >                          kvm_vcpu_kick(vcpu);
> > >
> > > needs a tweak to fall back to setting the PINV in L1's IRR:
> > >
> > >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
> > >                          /* set PINV in L1's IRR */
> > >                         kvm_vcpu_kick(vcpu);
> > >                 }
> >
> > Yeah, I think that's fair. Regardless, the pi_pending bit should've
> > only been set if the IPI was actually sent. Though I suppose
> 
> Didn't finish my thought :-/
> 
> Though I suppose pi_pending was set unconditionally (and skipped the
> IRR) since until recently KVM completely bungled handling the PINV
> correctly when in the L1 IRR.
> 
> >
> > > but you also have to do the same *in the PINV handler*
> > > sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
> > > L2->L0 vmexit races against sending the IPI.
> >
> > Indeed, there is a race but are we assured that the target vCPU thread
> > is scheduled on the target CPU when that IPI arrives?
> >
> > I believe there is a 1-to-many relationship here, which is why I said
> > each CPU would need to maintain a linked list of possible vCPUs to
> > iterate and find the intended recipient.

Ya, the concern is that it's theoretically possible for the PINV to arrive in L0
after a different vCPU has been loaded (or even multiple different vCPUs).      
E.g. if the sending pCPU is hit with an NMI after checking vcpu->mode, and the  
NMI runs for some absurd amount of time.  If that happens, the PINV handler     
won't know which vCPU(s) should get an IRQ injected into L1 without additional  
tracking.  KVM would need to set something like nested.pi_pending before doing  
kvm_vcpu_trigger_posted_interrupt(), i.e. nothing really changes, it just gets  
more complex.

> > The process of removing vCPUs from the list where we caught the IPI
> > in L0 is quite clear, but it doesn't seem like we could ever know to
> > remove vCPUs from the list when hardware catches that IPI.

It's probably possible by shadowing the PI descriptor, but doing so would likely
wipe out the perf benefits of nested PI.

That being said, if we don't care about strictly adhering to the spec (sorry in
advance Jim), I think we could avoid nested.pi_pending if we're ok with KVM
processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.
KVM already does that with nested.pi_pending, so it might not be the worst idea
in the world since the existing nested PI implementation mostly works.