+Mingwei On Thu, Jun 29, 2023, Jim Mattson wrote: > On Mon, Jun 5, 2023 at 5:51 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Thu, May 04, 2023, Roman Kagan wrote: > > > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h > > > index 5c7bbf03b599..6a91e1afef5a 100644 > > > --- a/arch/x86/kvm/pmu.h > > > +++ b/arch/x86/kvm/pmu.h > > > @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc) > > > return counter & pmc_bitmask(pmc); > > > } > > > > > > +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val) > > > +{ > > > + pmc->counter += val - pmc_read_counter(pmc); > > > > Ugh, not your code, but I don't see how the current code can possibly be correct. > > > > The above unpacks to > > > > counter = pmc->counter; > > if (pmc->perf_event && !pmc->is_paused) > > counter += perf_event_read_value(pmc->perf_event, > > &enabled, &running); > > pmc->counter += val - (counter & pmc_bitmask(pmc)); > > > > which distills down to > > > > counter = 0; > > if (pmc->perf_event && !pmc->is_paused) > > counter += perf_event_read_value(pmc->perf_event, > > &enabled, &running); > > pmc->counter = val - (counter & pmc_bitmask(pmc)); > > > > or more succinctly > > > > if (pmc->perf_event && !pmc->is_paused) > > val -= perf_event_read_value(pmc->perf_event, &enabled, &running); > > > > pmc->counter = val; > > > > which is obviously wrong. E.g. if the guest writes '0' to an active counter, the > > adjustment will cause pmc->counter to be loaded with a large (in unsigned terms) > > value, and thus quickly overflow after a write of '0'. > > This weird construct goes all the way back to commit f5132b01386b > ("KVM: Expose a version 2 architectural PMU to a guests"). Paolo > killed it in commit 2924b52117b2 ("KVM: x86/pmu: do not mask the value > that is written to fixed PMUs"), perhaps by accident. Eric then > resurrected it in commit 4400cf546b4b ("KVM: x86: Fix perfctr WRMSR > for running counters"). > > It makes no sense to me. WRMSR should just set the new value of the > counter, regardless of the old value or whether or not it is running. Heh, didn't stop you from giving Eric's patch a thumbs-up[*] :-) Thanks to Eric's testcase [Wow, tests do help! We should try writing more of them!], I finally figured out what's going on. I wrongly assumed perf_event_read_value() is destructive, but it's not, it just reads the current value. So on a WRMSR, KVM offsets the value with the current perf event, and then *mostly* adjusts for it when reading the counter. But that is obviously super fragile because it means pmc->counter must never be read directly unless the perf event is paused and the accumulated counter has been propagated to pmc->counter. Blech. I fiddled with a variety of things, but AFAICT the easiest solution is also the most obviously correct: set perf's count to the guest's count. Lightly tested patch below. On a related topic, Mingwei also appears to have found another bug: prev_counter needs to be set when the counter is written, i.e. my proposed pmc_write_counter() also needs to update prev_counter. Though that also raises the question of whether or not zeroing prev_counter in reprogram_counter() is correct. Unless I'm missing something, reprogram_counter() should also set pmc->prev_counter to pmc->counter when the counter is successfully (re)enabled. And Jim also pointed out that prev_counter needs to be set even when KVM fails to enable a perf event (software counting should still work). [*] https://lore.kernel.org/all/CALMp9eRfeFFb6n22Uf4R2Pf8WW7BVLX_Vuf04WFwiMtrk14Y-Q@xxxxxxxxxxxxxx --- arch/x86/kvm/pmu.h | 8 ++++++++ arch/x86/kvm/svm/pmu.c | 2 +- arch/x86/kvm/vmx/pmu_intel.c | 4 ++-- include/linux/perf_event.h | 2 ++ kernel/events/core.c | 11 +++++++++++ 5 files changed, 24 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 7d9ba301c090..ba91a78e4dc1 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc) return counter & pmc_bitmask(pmc); } +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val) +{ + if (pmc->perf_event && !pmc->is_paused) + perf_event_set_count(pmc->perf_event, val); + + pmc->counter = val; +} + static inline void pmc_release_perf_event(struct kvm_pmc *pmc) { if (pmc->perf_event) { diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c index cef5a3d0abd0..373ff6a6687b 100644 --- a/arch/x86/kvm/svm/pmu.c +++ b/arch/x86/kvm/svm/pmu.c @@ -160,7 +160,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) /* MSR_PERFCTRn */ pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); if (pmc) { - pmc->counter += data - pmc_read_counter(pmc); + pmc_write_counter(pmc, data); pmc_update_sample_period(pmc); return 0; } diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 80c769c58a87..18a658aa2a8d 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -406,11 +406,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!msr_info->host_initiated && !(msr & MSR_PMC_FULL_WIDTH_BIT)) data = (s64)(s32)data; - pmc->counter += data - pmc_read_counter(pmc); + pmc_write_counter(pmc, data); pmc_update_sample_period(pmc); break; } else if ((pmc = get_fixed_pmc(pmu, msr))) { - pmc->counter += data - pmc_read_counter(pmc); + pmc_write_counter(pmc, data); pmc_update_sample_period(pmc); break; } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) { diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index d5628a7b5eaa..8fcd52a87ba2 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1677,6 +1677,7 @@ extern void perf_event_disable_inatomic(struct perf_event *event); extern void perf_event_task_tick(void); extern int perf_event_account_interrupt(struct perf_event *event); extern int perf_event_period(struct perf_event *event, u64 value); +extern void perf_event_set_count(struct perf_event *event, u64 count); extern u64 perf_event_pause(struct perf_event *event, bool reset); #else /* !CONFIG_PERF_EVENTS: */ static inline void * @@ -1760,6 +1761,7 @@ static inline int perf_event_period(struct perf_event *event, u64 value) { return -EINVAL; } +static inline perf_event_set_count(struct perf_event *event, u64 count) { } static inline u64 perf_event_pause(struct perf_event *event, bool reset) { return 0; diff --git a/kernel/events/core.c b/kernel/events/core.c index db016e418931..d368c283eba5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5646,6 +5646,17 @@ static void _perf_event_reset(struct perf_event *event) perf_event_update_userpage(event); } +void perf_event_set_count(struct perf_event *event, u64 count) +{ + struct perf_event_context *ctx; + + ctx = perf_event_ctx_lock(event); + (void)perf_event_read(event, false); + local64_set(&event->count, count); + perf_event_ctx_unlock(event, ctx); +} +EXPORT_SYMBOL_GPL(perf_event_set_count); + /* Assume it's not an event with inherit set. */ u64 perf_event_pause(struct perf_event *event, bool reset) { base-commit: 5ae85a1bd17b959796f6cc4c1153ceada2cf8f24 --