Re: [PATCH 1/2] KVM: x86: Synthesize at most one PMI per VM-exit

Like Xu <like.xu.linux@xxxxxxxxx> · Mon, 25 Sep 2023 15:33:51 +0800

On 23/9/2023 3:21 am, Sean Christopherson wrote:
On Fri, Sep 22, 2023, Jim Mattson wrote:
On Fri, Sep 22, 2023 at 11:46 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:

On Fri, Sep 01, 2023, Jim Mattson wrote:
When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
VM-exit that also invokes __kvm_perf_overflow() as a result of
instruction emulation, kvm_pmu_deliver_pmi() will be called twice
before the next VM-entry.

That shouldn't be a problem. The local APIC is supposed to
automatically set the mask flag in LVTPC when it handles a PMI, so the
second PMI should be inhibited. However, KVM's local APIC emulation
fails to set the mask flag in LVTPC when it handles a PMI, so two PMIs
are delivered via the local APIC. In the common case, where LVTPC is
configured to deliver an NMI, the first NMI is vectored through the
guest IDT, and the second one is held pending. When the NMI handler
returns, the second NMI is vectored through the IDT. For Linux guests,
this results in the "dazed and confused" spurious NMI message.

Though the obvious fix is to set the mask flag in LVTPC when handling
a PMI, KVM's logic around synthesizing a PMI is unnecessarily
convoluted.

To address Like's question about whether not this is necessary, I think we should
rephrase this to explicitly state this is a bug irrespective of the whole LVTPC
masking thing.

And I think it makes sense to swap the order of the two patches.  The LVTPC masking
fix is a clearcut architectural violation.  This is a bit more of a grey area,
though still blatantly buggy.

The reason I ordered the patches as I did is that when this patch
comes first, it actually fixes the problem that was introduced in
commit 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring
instructions"). If this patch comes second, it's less clear that it
fixes a bug, since the other patch renders this one essentially moot.

Yeah, but as Like pointed out, the way the changelog is worded just raises the
question of why this change is necessary.

I think we should tag them both for stable.  They're both bug fixes, regardless
of the ordering.

In the semantics of "at most one PMI per VM exit", what if the PMI-caused
vm-exit itself triggers a guest counter overflow and triggers vPMI (for
example, at this time the L1 guest is counting the number of vm-exit events
from the L2 guest), will the latter interrupt be swallowed by L0 KVM ? What
is the correct expectation ? It may be different on Intel and AMD.