On Mon, Jun 5, 2023 at 5:51 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, May 04, 2023, Roman Kagan wrote: > > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h > > index 5c7bbf03b599..6a91e1afef5a 100644 > > --- a/arch/x86/kvm/pmu.h > > +++ b/arch/x86/kvm/pmu.h > > @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc) > > return counter & pmc_bitmask(pmc); > > } > > > > +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val) > > +{ > > + pmc->counter += val - pmc_read_counter(pmc); > > Ugh, not your code, but I don't see how the current code can possibly be correct. > > The above unpacks to > > counter = pmc->counter; > if (pmc->perf_event && !pmc->is_paused) > counter += perf_event_read_value(pmc->perf_event, > &enabled, &running); > pmc->counter += val - (counter & pmc_bitmask(pmc)); > > which distills down to > > counter = 0; > if (pmc->perf_event && !pmc->is_paused) > counter += perf_event_read_value(pmc->perf_event, > &enabled, &running); > pmc->counter = val - (counter & pmc_bitmask(pmc)); > > or more succinctly > > if (pmc->perf_event && !pmc->is_paused) > val -= perf_event_read_value(pmc->perf_event, &enabled, &running); > > pmc->counter = val; > > which is obviously wrong. E.g. if the guest writes '0' to an active counter, the > adjustment will cause pmc->counter to be loaded with a large (in unsigned terms) > value, and thus quickly overflow after a write of '0'. This weird construct goes all the way back to commit f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests"). Paolo killed it in commit 2924b52117b2 ("KVM: x86/pmu: do not mask the value that is written to fixed PMUs"), perhaps by accident. Eric then resurrected it in commit 4400cf546b4b ("KVM: x86: Fix perfctr WRMSR for running counters"). It makes no sense to me. WRMSR should just set the new value of the counter, regardless of the old value or whether or not it is running. > I assume the intent it to purge any accumulated counts that occurred since the > last read, but *before* the write, but then shouldn't this just be: > > /* Purge any events that were acculumated before the write. */ > if (pmc->perf_event && !pmc->is_paused) > (void)perf_event_read_value(pmc->perf_event, &enabled, &running); > > pmc->counter = val & pmc_bitmask(pmc); > > Am I missing something? > > I'd like to get this sorted out before applying this patch, because I really want > to document what on earth this new helper is doing. Seeing a bizarre partial > RMW operation in a helper with "set" as the verb is super weird.