Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending

Oliver Upton <oupton@xxxxxxxxxx> · Mon, 23 Nov 2020 16:10:48 -0800

On Mon, Nov 23, 2020 at 2:42 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 23/11/20 20:22, Oliver Upton wrote:
> > The pi_pending bit works rather well as it is only a hint to KVM that it
> > may owe the guest a posted-interrupt completion. However, if we were to
> > set the guest's nested PINV as pending in the L1 IRR it'd be challenging
> > to infer whether or not it should actually be injected in L1 or result
> > in posted-interrupt processing for L2.
>
> Stupid question: why does it matter?  The behavior when the PINV is
> delivered does not depend on the time it enters the IRR, only on the
> time that it enters ISR.  If that happens while the vCPU while in L2, it
> would trigger posted interrupt processing; if PINV moves to ISR while in
> L1, it would be delivered normally as an interrupt.
>
> There are various special cases but they should fall in place.  For
> example, if PINV is delivered during L1 vmentry (with IF=0), it would be
> delivered at the next inject_pending_event when the VMRUN vmexit is
> processed and interrupts are unmasked.
>
> The tricky case is when L0 tries to deliver the PINV to L1 as a posted
> interrupt, i.e. in vmx_deliver_nested_posted_interrupt.  Then the
>
>                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
>                          kvm_vcpu_kick(vcpu);
>
> needs a tweak to fall back to setting the PINV in L1's IRR:
>
>                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
>                          /* set PINV in L1's IRR */
>                         kvm_vcpu_kick(vcpu);
>                 }

Yeah, I think that's fair. Regardless, the pi_pending bit should've
only been set if the IPI was actually sent. Though I suppose

> but you also have to do the same *in the PINV handler*
> sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
> L2->L0 vmexit races against sending the IPI.

Indeed, there is a race but are we assured that the target vCPU thread
is scheduled on the target CPU when that IPI arrives?

I believe there is a 1-to-many relationship here, which is why I said
each CPU would need to maintain a linked list of possible vCPUs to
iterate and find the intended recipient. The process of removing vCPUs
from the list where we caught the IPI in L0 is quite clear, but it
doesn't seem like we could ever know to remove vCPUs from the list
when hardware catches that IPI.

If the ISR thing can be figured out then that'd be great, though it
seems infeasible because we are racing with scheduling on the target.

Could we split the difference and do something like:

        if (kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
                vmx->nested.pi_pending = true;
        } else {
                /* set PINV in L1's IRR */
                kvm_vcpu_kick(vcpu);
        }

which ensures we only set the hint when KVM might actually have
something to do. Otherwise, it'll deliver to L1 like a normal
interrupt or trigger posted-interrupt processing on nested VM-entry if
IF=0.

> What am I missing?
>
> Paolo
>