Re: [PATCH 1/2] KVM: x86: Synthesize at most one PMI per VM-exit

Mingwei Zhang <mizhang@xxxxxxxxxx> · Mon, 25 Sep 2023 12:54:45 -0700



On Fri, Sep 22, 2023 at 2:02 PM Mingwei Zhang <mizhang@xxxxxxxxxx> wrote:
>
> On Fri, Sep 22, 2023, Mingwei Zhang wrote:
> > On Fri, Sep 22, 2023 at 1:34 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Sep 22, 2023, Mingwei Zhang wrote:
> > > > On Fri, Sep 22, 2023 at 12:21 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, Sep 22, 2023, Jim Mattson wrote:
> > > > > > On Fri, Sep 22, 2023 at 11:46 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Fri, Sep 01, 2023, Jim Mattson wrote:
> > > > > > > > When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
> > > > > > > > VM-exit that also invokes __kvm_perf_overflow() as a result of
> > > > > > > > instruction emulation, kvm_pmu_deliver_pmi() will be called twice
> > > > > > > > before the next VM-entry.
> > > > > > > >
> > > > > > > > That shouldn't be a problem. The local APIC is supposed to
> > > > > > > > automatically set the mask flag in LVTPC when it handles a PMI, so the
> > > > > > > > second PMI should be inhibited. However, KVM's local APIC emulation
> > > > > > > > fails to set the mask flag in LVTPC when it handles a PMI, so two PMIs
> > > > > > > > are delivered via the local APIC. In the common case, where LVTPC is
> > > > > > > > configured to deliver an NMI, the first NMI is vectored through the
> > > > > > > > guest IDT, and the second one is held pending. When the NMI handler
> > > > > > > > returns, the second NMI is vectored through the IDT. For Linux guests,
> > > > > > > > this results in the "dazed and confused" spurious NMI message.
> > > > > > > >
> > > > > > > > Though the obvious fix is to set the mask flag in LVTPC when handling
> > > > > > > > a PMI, KVM's logic around synthesizing a PMI is unnecessarily
> > > > > > > > convoluted.
> > > > > > >
> > > > > > > To address Like's question about whether not this is necessary, I think we should
> > > > > > > rephrase this to explicitly state this is a bug irrespective of the whole LVTPC
> > > > > > > masking thing.
> > > > > > >
> > > > > > > And I think it makes sense to swap the order of the two patches.  The LVTPC masking
> > > > > > > fix is a clearcut architectural violation.  This is a bit more of a grey area,
> > > > > > > though still blatantly buggy.
> > > > > >
> > > > > > The reason I ordered the patches as I did is that when this patch
> > > > > > comes first, it actually fixes the problem that was introduced in
> > > > > > commit 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring
> > > > > > instructions"). If this patch comes second, it's less clear that it
> > > > > > fixes a bug, since the other patch renders this one essentially moot.
> > > > >
> > > > > Yeah, but as Like pointed out, the way the changelog is worded just raises the
> > > > > question of why this change is necessary.
> > > > >
> > > > > I think we should tag them both for stable.  They're both bug fixes, regardless
> > > > > of the ordering.
> > > >
> > > > Agree. Both patches are fixing the general potential buggy situation
> > > > of multiple PMI injection on one vm entry: one software level defense
> > > > (forcing the usage of KVM_REQ_PMI) and one hardware level defense
> > > > (preventing PMI injection using mask).
> > > >
> > > > Although neither patch in this series is fixing the root cause of this
> > > > specific double PMI injection bug, I don't see a reason why we cannot
> > > > add a "fixes" tag to them, since we may fix it and create it again.
> > > >
> > > > I am currently working on it and testing my patch. Please give me some
> > > > time, I think I could try sending out one version today. Once that is
> > > > done, I will combine mine with the existing patch and send it out as a
> > > > series.
> > >
> > > Me confused, what patch?  And what does this patch have to do with Jim's series?
> > > Unless I've missed something, Jim's patches are good to go with my nits addressed.
> >
> > Let me step back.
> >
> > We have the following problem when we run perf inside guest:
> >
> > [ 1437.487320] Uhhuh. NMI received for unknown reason 20 on CPU 3.
> > [ 1437.487330] Dazed and confused, but trying to continue
> >
> > This means there are more NMIs that guest PMI could understand. So
> > there are potentially two approaches to solve the problem: 1) fix the
> > PMI injection issue: only one can be injected; 2) fix the code that
> > causes the (incorrect) multiple PMI injection.
> >
> > I am working on the 2nd one. So, the property of the 2nd one is:
> > without patches in 1) (Jim's patches), we could still avoid the above
> > warning messages.
> >
> > Thanks.
> > -Mingwei
>
> This is my draft version. If you don't have full-width counter support, this
> patch needs be placed on top of this one:
> https://lore.kernel.org/all/20230504120042.785651-1-rkagan@xxxxxxxxx/
>
> My initial testing on both QEMU and our GCP testing environment shows no
> "Uhhuh..." dmesg in guest.
>
> Please take a look...
>
> From 47e629269d8b0ff65c242334f068300216cb7f91 Mon Sep 17 00:00:00 2001
> From: Mingwei Zhang <mizhang@xxxxxxxxxx>
> Date: Fri, 22 Sep 2023 17:13:55 +0000
> Subject: [PATCH] KVM: x86/pmu: Fix emulated counter increment due to
>  instruction emulation
>
> Fix KVM emulated counter increment due to instruction emulation. KVM
> pmc->counter is always a snapshot value when counter is running. Therefore,
> the value does not represent the actual value of counter. Thus it is
> inappropriate to compare it with other counter values. In existing code
> KVM directly compares pmc->prev_counter and pmc->counter directly. However,
> pmc->prev_counter is a snaphot value assigned from pmc->counter when
> counter may still be running.  So this comparison logic in
> reprogram_counter() will generate incorrect invocations to
> __kvm_perf_overflow(in_pmi=false) and generate duplicated PMI injection
> requests.
>
> Fix this issue by adding emulated_counter field and only the do the counter
> calculation after we pause
>
> Change-Id: I2d59e68557fd35f7bbcfe09ea42ad81bd36776b7
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/pmu.c              | 15 ++++++++-------
>  arch/x86/kvm/svm/pmu.c          |  1 +
>  arch/x86/kvm/vmx/pmu_intel.c    |  2 ++
>  4 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1a4def36d5bb..47bbfbc0aa35 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -494,6 +494,7 @@ struct kvm_pmc {
>         bool intr;
>         u64 counter;
>         u64 prev_counter;
> +       u64 emulated_counter;
>         u64 eventsel;
>         struct perf_event *perf_event;
>         struct kvm_vcpu *vcpu;
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index edb89b51b383..47acf3a2b077 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -240,12 +240,13 @@ static void pmc_pause_counter(struct kvm_pmc *pmc)
>  {
>         u64 counter = pmc->counter;
>
> -       if (!pmc->perf_event || pmc->is_paused)
> -               return;
> -
>         /* update counter, reset event value to avoid redundant accumulation */
> -       counter += perf_event_pause(pmc->perf_event, true);
> -       pmc->counter = counter & pmc_bitmask(pmc);
> +       if (pmc->perf_event && !pmc->is_paused)
> +               counter += perf_event_pause(pmc->perf_event, true);
> +
> +       pmc->prev_counter = counter & pmc_bitmask(pmc);
> +       pmc->counter = (counter + pmc->emulated_counter) & pmc_bitmask(pmc);
> +       pmc->emulated_counter = 0;
>         pmc->is_paused = true;
>  }
>
> @@ -452,6 +453,7 @@ static void reprogram_counter(struct kvm_pmc *pmc)
>  reprogram_complete:
>         clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
>         pmc->prev_counter = 0;
> +       pmc->emulated_counter = 0;
>  }
>
>  void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> @@ -725,8 +727,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>
>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>  {
> -       pmc->prev_counter = pmc->counter;
> -       pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
> +       pmc->emulated_counter += 1;
>         kvm_pmu_request_counter_reprogram(pmc);
>  }
>
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index a25b91ff9aea..b88fab4ae1d7 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -243,6 +243,7 @@ static void amd_pmu_reset(struct kvm_vcpu *vcpu)
>
>                 pmc_stop_counter(pmc);
>                 pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
> +               pmc->emulated_counter = 0;
>         }
>
>         pmu->global_ctrl = pmu->global_status = 0;
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 626df5fdf542..d03c4ec7273d 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -641,6 +641,7 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
>
>                 pmc_stop_counter(pmc);
>                 pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
> +               pmc->emulated_counter = 0;
>         }
>
>         for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
> @@ -648,6 +649,7 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
>
>                 pmc_stop_counter(pmc);
>                 pmc->counter = pmc->prev_counter = 0;
> +               pmc->emulated_counter = 0;
>         }
>
>         pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
> --
> 2.42.0.515.g380fc7ccd1-goog

Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx>