Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending

Oliver Upton <oupton@xxxxxxxxxx> · Mon, 23 Nov 2020 16:13:49 -0800

On Mon, Nov 23, 2020 at 4:10 PM Oliver Upton <oupton@xxxxxxxxxx> wrote:
>
> On Mon, Nov 23, 2020 at 2:42 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> >
> > On 23/11/20 20:22, Oliver Upton wrote:
> > > The pi_pending bit works rather well as it is only a hint to KVM that it
> > > may owe the guest a posted-interrupt completion. However, if we were to
> > > set the guest's nested PINV as pending in the L1 IRR it'd be challenging
> > > to infer whether or not it should actually be injected in L1 or result
> > > in posted-interrupt processing for L2.
> >
> > Stupid question: why does it matter?  The behavior when the PINV is
> > delivered does not depend on the time it enters the IRR, only on the
> > time that it enters ISR.  If that happens while the vCPU while in L2, it
> > would trigger posted interrupt processing; if PINV moves to ISR while in
> > L1, it would be delivered normally as an interrupt.
> >
> > There are various special cases but they should fall in place.  For
> > example, if PINV is delivered during L1 vmentry (with IF=0), it would be
> > delivered at the next inject_pending_event when the VMRUN vmexit is
> > processed and interrupts are unmasked.
> >
> > The tricky case is when L0 tries to deliver the PINV to L1 as a posted
> > interrupt, i.e. in vmx_deliver_nested_posted_interrupt.  Then the
> >
> >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
> >                          kvm_vcpu_kick(vcpu);
> >
> > needs a tweak to fall back to setting the PINV in L1's IRR:
> >
> >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
> >                          /* set PINV in L1's IRR */
> >                         kvm_vcpu_kick(vcpu);
> >                 }
>
> Yeah, I think that's fair. Regardless, the pi_pending bit should've
> only been set if the IPI was actually sent. Though I suppose

Didn't finish my thought :-/

Though I suppose pi_pending was set unconditionally (and skipped the
IRR) since until recently KVM completely bungled handling the PINV
correctly when in the L1 IRR.

>
> > but you also have to do the same *in the PINV handler*
> > sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
> > L2->L0 vmexit races against sending the IPI.
>
> Indeed, there is a race but are we assured that the target vCPU thread
> is scheduled on the target CPU when that IPI arrives?
>
> I believe there is a 1-to-many relationship here, which is why I said
> each CPU would need to maintain a linked list of possible vCPUs to
> iterate and find the intended recipient. The process of removing vCPUs
> from the list where we caught the IPI in L0 is quite clear, but it
> doesn't seem like we could ever know to remove vCPUs from the list
> when hardware catches that IPI.
>
> If the ISR thing can be figured out then that'd be great, though it
> seems infeasible because we are racing with scheduling on the target.
>
> Could we split the difference and do something like:
>
>         if (kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
>                 vmx->nested.pi_pending = true;
>         } else {
>                 /* set PINV in L1's IRR */
>                 kvm_vcpu_kick(vcpu);
>         }
>
> which ensures we only set the hint when KVM might actually have
> something to do. Otherwise, it'll deliver to L1 like a normal
> interrupt or trigger posted-interrupt processing on nested VM-entry if
> IF=0.
>
> > What am I missing?
> >
> > Paolo
> >