Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 29, 2023 at 5:11 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> +Mingwei
>
> On Thu, Jun 29, 2023, Jim Mattson wrote:
> > On Mon, Jun 5, 2023 at 5:51 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Thu, May 04, 2023, Roman Kagan wrote:
> > > > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> > > > index 5c7bbf03b599..6a91e1afef5a 100644
> > > > --- a/arch/x86/kvm/pmu.h
> > > > +++ b/arch/x86/kvm/pmu.h
> > > > @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > >       return counter & pmc_bitmask(pmc);
> > > >  }
> > > >
> > > > +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> > > > +{
> > > > +     pmc->counter += val - pmc_read_counter(pmc);
> > >
> > > Ugh, not your code, but I don't see how the current code can possibly be correct.
> > >
> > > The above unpacks to
> > >
> > >         counter = pmc->counter;
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 counter += perf_event_read_value(pmc->perf_event,
> > >                                                  &enabled, &running);
> > >         pmc->counter += val - (counter & pmc_bitmask(pmc));
> > >
> > > which distills down to
> > >
> > >         counter = 0;
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 counter += perf_event_read_value(pmc->perf_event,
> > >                                                  &enabled, &running);
> > >         pmc->counter = val - (counter & pmc_bitmask(pmc));
> > >
> > > or more succinctly
> > >
> > >         if (pmc->perf_event && !pmc->is_paused)
> > >                 val -= perf_event_read_value(pmc->perf_event, &enabled, &running);
> > >
> > >         pmc->counter = val;
> > >
> > > which is obviously wrong.  E.g. if the guest writes '0' to an active counter, the
> > > adjustment will cause pmc->counter to be loaded with a large (in unsigned terms)
> > > value, and thus quickly overflow after a write of '0'.
> >
> > This weird construct goes all the way back to commit f5132b01386b
> > ("KVM: Expose a version 2 architectural PMU to a guests"). Paolo
> > killed it in commit 2924b52117b2 ("KVM: x86/pmu: do not mask the value
> > that is written to fixed PMUs"), perhaps by accident. Eric then
> > resurrected it in commit 4400cf546b4b ("KVM: x86: Fix perfctr WRMSR
> > for running counters").
> >
> > It makes no sense to me. WRMSR should just set the new value of the
> > counter, regardless of the old value or whether or not it is running.
>
> Heh, didn't stop you from giving Eric's patch a thumbs-up[*] :-)

I was smarter then, and probably understood what was going on.

> Thanks to Eric's testcase [Wow, tests do help!  We should try writing more of them!],
> I finally figured out what's going on.  I wrongly assumed perf_event_read_value()
> is destructive, but it's not, it just reads the current value.  So on a WRMSR,
> KVM offsets the value with the current perf event, and then *mostly* adjusts for
> it when reading the counter.
>
> But that is obviously super fragile because it means pmc->counter must never be
> read directly unless the perf event is paused and the accumulated counter has been
> propagated to pmc->counter.  Blech.
>
> I fiddled with a variety of things, but AFAICT the easiest solution is also the
> most obviously correct: set perf's count to the guest's count.  Lightly tested
> patch below.

That seems correct to me. :)

> On a related topic, Mingwei also appears to have found another bug: prev_counter
> needs to be set when the counter is written, i.e. my proposed pmc_write_counter()
> also needs to update prev_counter.
>
> Though that also raises the question of whether or not zeroing prev_counter in
> reprogram_counter() is correct.  Unless I'm missing something, reprogram_counter()
> should also set pmc->prev_counter to pmc->counter when the counter is successfully
> (re)enabled.
>
> And Jim also pointed out that prev_counter needs to be set even when KVM fails
> to enable a perf event (software counting should still work).

This whole prev_counter business is a mess. The software counting
increments by at most one per VM-exit. The only time we need to
concern ourselves about overflow is when there has been a software
increment and the paused counter value is -1. That is the only
situation in which software counting can trigger a PMI. Even if the
VM-exit was for a WRMSR to the PMC, assuming that the value just
written by the guest is the value we see when we pause the counter. Or
am I missing something?

> [*] https://lore.kernel.org/all/CALMp9eRfeFFb6n22Uf4R2Pf8WW7BVLX_Vuf04WFwiMtrk14Y-Q@xxxxxxxxxxxxxx
>
> ---
>  arch/x86/kvm/pmu.h           |  8 ++++++++
>  arch/x86/kvm/svm/pmu.c       |  2 +-
>  arch/x86/kvm/vmx/pmu_intel.c |  4 ++--
>  include/linux/perf_event.h   |  2 ++
>  kernel/events/core.c         | 11 +++++++++++
>  5 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 7d9ba301c090..ba91a78e4dc1 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>         return counter & pmc_bitmask(pmc);
>  }
>
> +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> +{
> +       if (pmc->perf_event && !pmc->is_paused)
> +               perf_event_set_count(pmc->perf_event, val);
> +
> +       pmc->counter = val;
> +}
> +
>  static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
>  {
>         if (pmc->perf_event) {
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index cef5a3d0abd0..373ff6a6687b 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -160,7 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>         /* MSR_PERFCTRn */
>         pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
>         if (pmc) {
> -               pmc->counter += data - pmc_read_counter(pmc);
> +               pmc_write_counter(pmc, data);
>                 pmc_update_sample_period(pmc);
>                 return 0;
>         }
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 80c769c58a87..18a658aa2a8d 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -406,11 +406,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>                         if (!msr_info->host_initiated &&
>                             !(msr & MSR_PMC_FULL_WIDTH_BIT))
>                                 data = (s64)(s32)data;
> -                       pmc->counter += data - pmc_read_counter(pmc);
> +                       pmc_write_counter(pmc, data);
>                         pmc_update_sample_period(pmc);
>                         break;
>                 } else if ((pmc = get_fixed_pmc(pmu, msr))) {
> -                       pmc->counter += data - pmc_read_counter(pmc);
> +                       pmc_write_counter(pmc, data);
>                         pmc_update_sample_period(pmc);
>                         break;
>                 } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d5628a7b5eaa..8fcd52a87ba2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1677,6 +1677,7 @@ extern void perf_event_disable_inatomic(struct perf_event *event);
>  extern void perf_event_task_tick(void);
>  extern int perf_event_account_interrupt(struct perf_event *event);
>  extern int perf_event_period(struct perf_event *event, u64 value);
> +extern void perf_event_set_count(struct perf_event *event, u64 count);
>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
> @@ -1760,6 +1761,7 @@ static inline int perf_event_period(struct perf_event *event, u64 value)
>  {
>         return -EINVAL;
>  }
> +static inline perf_event_set_count(struct perf_event *event, u64 count) { }
>  static inline u64 perf_event_pause(struct perf_event *event, bool reset)
>  {
>         return 0;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index db016e418931..d368c283eba5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5646,6 +5646,17 @@ static void _perf_event_reset(struct perf_event *event)
>         perf_event_update_userpage(event);
>  }
>
> +void perf_event_set_count(struct perf_event *event, u64 count)
> +{
> +       struct perf_event_context *ctx;
> +
> +       ctx = perf_event_ctx_lock(event);
> +       (void)perf_event_read(event, false);
> +       local64_set(&event->count, count);
> +       perf_event_ctx_unlock(event, ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_event_set_count);
> +
>  /* Assume it's not an event with inherit set. */
>  u64 perf_event_pause(struct perf_event *event, bool reset)
>  {
>
> base-commit: 5ae85a1bd17b959796f6cc4c1153ceada2cf8f24
> --




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux