Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending

Jim Mattson <jmattson@xxxxxxxxxx> · Thu, 3 Dec 2020 14:07:10 -0800

On Tue, Nov 24, 2020 at 3:39 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 24/11/20 02:55, Sean Christopherson wrote:
> >>> I believe there is a 1-to-many relationship here, which is why I said
> >>> each CPU would need to maintain a linked list of possible vCPUs to
> >>> iterate and find the intended recipient.
> >
> > Ya, the concern is that it's theoretically possible for the PINV to arrive in L0
> > after a different vCPU has been loaded (or even multiple different vCPUs).
> > E.g. if the sending pCPU is hit with an NMI after checking vcpu->mode, and the
> > NMI runs for some absurd amount of time.  If that happens, the PINV handler
> > won't know which vCPU(s) should get an IRQ injected into L1 without additional
> > tracking.  KVM would need to set something like nested.pi_pending before doing
> > kvm_vcpu_trigger_posted_interrupt(), i.e. nothing really changes, it just gets
> > more complex.
>
> Ah, gotcha.  What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a
> generation count?  Then you reread vcpu->mode after sending the IPI, and
> retry if it does not match.

Meanwhile, the IPI that you previously sent may have triggered posted
interrupt processing for the wrong vCPU.

Consider an overcommitted scenario, where each pCPU is running two
vCPUs. L1 initializes all of the L2 PIDs so that PIR[x] and PID.ON are
set. Then, it sends the VMCS12 NV IPI to one specific L2 vCPU. When L0
processes the VM-exit for sending the IPI, the target vCPU is running
IN_GUEST_MODE, so we send the vmcs02 NV to the corresponding pCPU.
But, by the time the IPI is received on the pCPU, a different L2 vCPU
is running. Oops. It seems that we have to pin the target vCPU to the
pCPU until the IPI arrives. But since it's the vmcs02 NV, we have no
way of knowing whether or not it has arrived.

If the requisite IPI delivery delays seem too absurd to contemplate,
consider pushing L0 down into L1.

> To guarantee atomicity, the OUTSIDE_GUEST_MODE IN_GUEST_MODE
> EXITING_GUEST_MODE READING_SHADOW_PAGE_TABLES values would remain in the
> bottom 2 bits.  That is, the vcpu->mode accesses like
>
>         vcpu->mode = IN_GUEST_MODE;
>
>         vcpu->mode = OUTSIDE_GUEST_MODE;
>
>         smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES);
>
>         smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);
>
>         return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>
> becoming
>
>         enum {
>                 OUTSIDE_GUEST_MODE,
>                 IN_GUEST_MODE,
>                 EXITING_GUEST_MODE,
>                 READING_SHADOW_PAGE_TABLES,
>                 GUEST_MODE_MASK = 3,
>         };
>
>         vcpu->mode = (vcpu->mode | GUEST_MODE_MASK) + 1 + IN_GUEST_MODE;
>
>         vcpu->mode &= ~GUEST_MODE_MASK;
>
>         smp_store_mb(vcpu->mode, vcpu->mode|READING_SHADOW_PAGE_TABLES);
>
>         smp_store_release(&vcpu->mode, vcpu->mode & ~GUEST_MODE_MASK);
>
>         int x = READ_ONCE(vcpu->mode);
>         do {
>                 if ((x & GUEST_MODE_MASK) != IN_GUEST_MODE)
>                         return x & GUEST_MODE_MASK;
>         } while (!try_cmpxchg(&vcpu->mode, &x,
>                               x ^ IN_GUEST_MODE ^ EXITING_GUEST_MODE))
>         return IN_GUEST_MODE;
>
> You could still get spurious posted interrupt IPIs, but the IPI handler
> need not do anything at all and that is much better.
>
> > if we're ok with KVM
> > processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
> > the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.
>
> I actually find it curious that the spec promises posted interrupt
> processing to be triggered only by the arrival of the posted interrupt
> IPI.  Why couldn't the processor in principle snoop for the address of
> the ON bit instead, similar to an MWAIT?
>
> But even without that, I don't think the spec promises that kind of
> strict ordering with respect to what goes on in the source.  Even though
> posted interrupt processing is atomic with the acknowledgement of the
> posted interrupt IPI, the spec only promises that the PINV triggers an
> _eventual_ scan of PID.PIR when the interrupt controller delivers an
> unmasked external interrupt to the destination CPU.  You can still have
> something like
>
>         set PID.PIR[100]
>         set PID.ON
>                                         processor starts executing a
>                                          very slow instruction...
>         send PINV
>         set PID.PIR[200]
>                                         acknowledge PINV
>
> and then vector 200 would be delivered before vector 100.  Of course
> with nested PI the effect would be amplified, but it's possible even on
> bare metal.
>
> Paolo
>