On Tue, Aug 22, 2023 at 2:30 AM Like Xu <like.xu.linux@xxxxxxxxx> wrote: > > On 1/7/2023 5:26 am, Sean Christopherson wrote: > > Ugh, yeah, de0f619564f4 created a bit of a mess. The underlying issue that it > > was solving is that perf_event_read_value() and friends might sleep (yay mutex), > > and so can't be called from KVM's fastpath (IRQs disabled). > Updating pmu counters for emulated instructions cause troubles. > > > > > However, detecting overflow requires reading perf_event_read_value() to gather > > the accumulated count from the hardware event in order to add it to the emulated > > count from software. E.g. if pmc->counter is X and the perf event counter is Y, > > KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not. > > > > Trying to snapshot the previous counter value is a bit of a mess. It could probably > > made to work, but it's hard to reason about what the snapshot actually contains > > and when it should be cleared, especially when factoring in the wrapping logic. > > > > Rather than snapshot the previous counter, I think it makes sense to: > > > > 1) Track the number of emulated counter events > > If events are counted separately, the challenge here is to correctly time > the emulation of counter overflows, which can occur on both sides of the > counter values out of sync. > > > 2) Accumulate and reset the counts from perf_event and emulated_counter into > > pmc->counter when pausing the PMC > > 3) Pause and reprogram the PMC on writes (instead of the current approach of > > blindly updating the sample period) > > Updating the sample period is the only interface for KVM to configure hw > behaviour on hw-ctr. I note that perf_event_set_count() will be proposed, > and I'm pessimistic about this change. > > > 4) Pause the counter when stopping the perf_event to ensure pmc->counter is > > fresh (instead of manually updating pmc->counter) > > > > IMO, that yields more intuitive logic, and makes it easier to reason about > > correctness since the behavior is easily define: pmc->counter holds the counts > > that have been gathered and processed, perf_event and emulated_counter hold > > outstanding counts on top. E.g. on a WRMSR to the counter, both the emulated > > counter and the hardware counter are reset, because whatever counts existed > > previously are irrelevant. > > If we take the hardware view, a counter, emulated or not, just increments > and overflows at the threshold. The missing logic here is when the counter > is truncated when writing high bit-width values, and how to deal with the > value of pmc->prev_counter was before pmc->counter was truncated. > > > > > Pausing the counter_might_ make WRMSR slower, but we need to get this all > > functionally correct before worrying too much about performance. > > Performance, security and correctness should all be considered at the beginning. > +1 on the performance part. I did several rounds of performance testing, pausing the counter is fast, but restarting the counter is *super* slow. The extra overhead might just make vPMU useless especially when the guest workload takes full CPU/memory resources in a VM (like SPEC2017 does). > > > > Diff below for what I'm thinking (needs to be split into multiple patches). It's > > *very* lightly tested. > > It saddens me that no one has come up with an actual low-level counter-test for > this issue. > > > > > I'm about to disappear for a week, I'll pick this back up when I get return. In > > the meantime, any testing and/or input would be much appreciated! > > How about accepting Roman's original fix and then exercising the rewriting genius ? +1 I think the best option would be to just apply the fix in a short term and put the refactoring of the emulated counter in the next series.