On Thu, Jun 29, 2023 at 5:11 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > +Mingwei > > On Thu, Jun 29, 2023, Jim Mattson wrote: > > On Mon, Jun 5, 2023 at 5:51 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Thu, May 04, 2023, Roman Kagan wrote: > > > > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h > > > > index 5c7bbf03b599..6a91e1afef5a 100644 > > > > --- a/arch/x86/kvm/pmu.h > > > > +++ b/arch/x86/kvm/pmu.h > > > > @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc) > > > > return counter & pmc_bitmask(pmc); > > > > } > > > > > > > > +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val) > > > > +{ > > > > + pmc->counter += val - pmc_read_counter(pmc); > > > > > > Ugh, not your code, but I don't see how the current code can possibly be correct. > > > > > > The above unpacks to > > > > > > counter = pmc->counter; > > > if (pmc->perf_event && !pmc->is_paused) > > > counter += perf_event_read_value(pmc->perf_event, > > > &enabled, &running); > > > pmc->counter += val - (counter & pmc_bitmask(pmc)); > > > > > > which distills down to > > > > > > counter = 0; > > > if (pmc->perf_event && !pmc->is_paused) > > > counter += perf_event_read_value(pmc->perf_event, > > > &enabled, &running); > > > pmc->counter = val - (counter & pmc_bitmask(pmc)); > > > > > > or more succinctly > > > > > > if (pmc->perf_event && !pmc->is_paused) > > > val -= perf_event_read_value(pmc->perf_event, &enabled, &running); > > > > > > pmc->counter = val; > > > > > > which is obviously wrong. E.g. if the guest writes '0' to an active counter, the > > > adjustment will cause pmc->counter to be loaded with a large (in unsigned terms) > > > value, and thus quickly overflow after a write of '0'. > > > > This weird construct goes all the way back to commit f5132b01386b > > ("KVM: Expose a version 2 architectural PMU to a guests"). Paolo > > killed it in commit 2924b52117b2 ("KVM: x86/pmu: do not mask the value > > that is written to fixed PMUs"), perhaps by accident. Eric then > > resurrected it in commit 4400cf546b4b ("KVM: x86: Fix perfctr WRMSR > > for running counters"). > > > > It makes no sense to me. WRMSR should just set the new value of the > > counter, regardless of the old value or whether or not it is running. > > Heh, didn't stop you from giving Eric's patch a thumbs-up[*] :-) I was smarter then, and probably understood what was going on. > Thanks to Eric's testcase [Wow, tests do help! We should try writing more of them!], > I finally figured out what's going on. I wrongly assumed perf_event_read_value() > is destructive, but it's not, it just reads the current value. So on a WRMSR, > KVM offsets the value with the current perf event, and then *mostly* adjusts for > it when reading the counter. > > But that is obviously super fragile because it means pmc->counter must never be > read directly unless the perf event is paused and the accumulated counter has been > propagated to pmc->counter. Blech. > > I fiddled with a variety of things, but AFAICT the easiest solution is also the > most obviously correct: set perf's count to the guest's count. Lightly tested > patch below. That seems correct to me. :) > On a related topic, Mingwei also appears to have found another bug: prev_counter > needs to be set when the counter is written, i.e. my proposed pmc_write_counter() > also needs to update prev_counter. > > Though that also raises the question of whether or not zeroing prev_counter in > reprogram_counter() is correct. Unless I'm missing something, reprogram_counter() > should also set pmc->prev_counter to pmc->counter when the counter is successfully > (re)enabled. > > And Jim also pointed out that prev_counter needs to be set even when KVM fails > to enable a perf event (software counting should still work). This whole prev_counter business is a mess. The software counting increments by at most one per VM-exit. The only time we need to concern ourselves about overflow is when there has been a software increment and the paused counter value is -1. That is the only situation in which software counting can trigger a PMI. Even if the VM-exit was for a WRMSR to the PMC, assuming that the value just written by the guest is the value we see when we pause the counter. Or am I missing something? > [*] https://lore.kernel.org/all/CALMp9eRfeFFb6n22Uf4R2Pf8WW7BVLX_Vuf04WFwiMtrk14Y-Q@xxxxxxxxxxxxxx > > --- > arch/x86/kvm/pmu.h | 8 ++++++++ > arch/x86/kvm/svm/pmu.c | 2 +- > arch/x86/kvm/vmx/pmu_intel.c | 4 ++-- > include/linux/perf_event.h | 2 ++ > kernel/events/core.c | 11 +++++++++++ > 5 files changed, 24 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h > index 7d9ba301c090..ba91a78e4dc1 100644 > --- a/arch/x86/kvm/pmu.h > +++ b/arch/x86/kvm/pmu.h > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc) > return counter & pmc_bitmask(pmc); > } > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val) > +{ > + if (pmc->perf_event && !pmc->is_paused) > + perf_event_set_count(pmc->perf_event, val); > + > + pmc->counter = val; > +} > + > static inline void pmc_release_perf_event(struct kvm_pmc *pmc) > { > if (pmc->perf_event) { > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c > index cef5a3d0abd0..373ff6a6687b 100644 > --- a/arch/x86/kvm/svm/pmu.c > +++ b/arch/x86/kvm/svm/pmu.c > @@ -160,7 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > /* MSR_PERFCTRn */ > pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); > if (pmc) { > - pmc->counter += data - pmc_read_counter(pmc); > + pmc_write_counter(pmc, data); > pmc_update_sample_period(pmc); > return 0; > } > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c > index 80c769c58a87..18a658aa2a8d 100644 > --- a/arch/x86/kvm/vmx/pmu_intel.c > +++ b/arch/x86/kvm/vmx/pmu_intel.c > @@ -406,11 +406,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > if (!msr_info->host_initiated && > !(msr & MSR_PMC_FULL_WIDTH_BIT)) > data = (s64)(s32)data; > - pmc->counter += data - pmc_read_counter(pmc); > + pmc_write_counter(pmc, data); > pmc_update_sample_period(pmc); > break; > } else if ((pmc = get_fixed_pmc(pmu, msr))) { > - pmc->counter += data - pmc_read_counter(pmc); > + pmc_write_counter(pmc, data); > pmc_update_sample_period(pmc); > break; > } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) { > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index d5628a7b5eaa..8fcd52a87ba2 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1677,6 +1677,7 @@ extern void perf_event_disable_inatomic(struct perf_event *event); > extern void perf_event_task_tick(void); > extern int perf_event_account_interrupt(struct perf_event *event); > extern int perf_event_period(struct perf_event *event, u64 value); > +extern void perf_event_set_count(struct perf_event *event, u64 count); > extern u64 perf_event_pause(struct perf_event *event, bool reset); > #else /* !CONFIG_PERF_EVENTS: */ > static inline void * > @@ -1760,6 +1761,7 @@ static inline int perf_event_period(struct perf_event *event, u64 value) > { > return -EINVAL; > } > +static inline perf_event_set_count(struct perf_event *event, u64 count) { } > static inline u64 perf_event_pause(struct perf_event *event, bool reset) > { > return 0; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index db016e418931..d368c283eba5 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -5646,6 +5646,17 @@ static void _perf_event_reset(struct perf_event *event) > perf_event_update_userpage(event); > } > > +void perf_event_set_count(struct perf_event *event, u64 count) > +{ > + struct perf_event_context *ctx; > + > + ctx = perf_event_ctx_lock(event); > + (void)perf_event_read(event, false); > + local64_set(&event->count, count); > + perf_event_ctx_unlock(event, ctx); > +} > +EXPORT_SYMBOL_GPL(perf_event_set_count); > + > /* Assume it's not an event with inherit set. */ > u64 perf_event_pause(struct perf_event *event, bool reset) > { > > base-commit: 5ae85a1bd17b959796f6cc4c1153ceada2cf8f24 > --