On 1/7/2023 5:26 am, Sean Christopherson wrote:
Ugh, yeah, de0f619564f4 created a bit of a mess. The underlying issue that it was solving is that perf_event_read_value() and friends might sleep (yay mutex), and so can't be called from KVM's fastpath (IRQs disabled).
Updating pmu counters for emulated instructions cause troubles.
However, detecting overflow requires reading perf_event_read_value() to gather the accumulated count from the hardware event in order to add it to the emulated count from software. E.g. if pmc->counter is X and the perf event counter is Y, KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not. Trying to snapshot the previous counter value is a bit of a mess. It could probably made to work, but it's hard to reason about what the snapshot actually contains and when it should be cleared, especially when factoring in the wrapping logic. Rather than snapshot the previous counter, I think it makes sense to: 1) Track the number of emulated counter events
If events are counted separately, the challenge here is to correctly time the emulation of counter overflows, which can occur on both sides of the counter values out of sync.
2) Accumulate and reset the counts from perf_event and emulated_counter into pmc->counter when pausing the PMC 3) Pause and reprogram the PMC on writes (instead of the current approach of blindly updating the sample period)
Updating the sample period is the only interface for KVM to configure hw behaviour on hw-ctr. I note that perf_event_set_count() will be proposed, and I'm pessimistic about this change.
4) Pause the counter when stopping the perf_event to ensure pmc->counter is fresh (instead of manually updating pmc->counter) IMO, that yields more intuitive logic, and makes it easier to reason about correctness since the behavior is easily define: pmc->counter holds the counts that have been gathered and processed, perf_event and emulated_counter hold outstanding counts on top. E.g. on a WRMSR to the counter, both the emulated counter and the hardware counter are reset, because whatever counts existed previously are irrelevant.
If we take the hardware view, a counter, emulated or not, just increments and overflows at the threshold. The missing logic here is when the counter is truncated when writing high bit-width values, and how to deal with the value of pmc->prev_counter was before pmc->counter was truncated.
Pausing the counter_might_ make WRMSR slower, but we need to get this all functionally correct before worrying too much about performance.
Performance, security and correctness should all be considered at the beginning.
Diff below for what I'm thinking (needs to be split into multiple patches). It's *very* lightly tested.
It saddens me that no one has come up with an actual low-level counter-test for this issue.
I'm about to disappear for a week, I'll pick this back up when I get return. In the meantime, any testing and/or input would be much appreciated!
How about accepting Roman's original fix and then exercising the rewriting genius ?