Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

Like Xu <like.xu.linux@xxxxxxxxx> · Tue, 22 Aug 2023 17:29:56 +0800

On 1/7/2023 5:26 am, Sean Christopherson wrote:
Ugh, yeah, de0f619564f4 created a bit of a mess.  The underlying issue that it
was solving is that perf_event_read_value() and friends might sleep (yay mutex),
and so can't be called from KVM's fastpath (IRQs disabled).
Updating pmu counters for emulated instructions cause troubles.

However, detecting overflow requires reading perf_event_read_value() to gather
the accumulated count from the hardware event in order to add it to the emulated
count from software.  E.g. if pmc->counter is X and the perf event counter is Y,
KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not.

Trying to snapshot the previous counter value is a bit of a mess.  It could probably
made to work, but it's hard to reason about what the snapshot actually contains
and when it should be cleared, especially when factoring in the wrapping logic.

Rather than snapshot the previous counter, I think it makes sense to:

   1) Track the number of emulated counter events

If events are counted separately, the challenge here is to correctly time
the emulation of counter overflows, which can occur on both sides of the
counter values out of sync.

   2) Accumulate and reset the counts from perf_event and emulated_counter into
      pmc->counter when pausing the PMC
   3) Pause and reprogram the PMC on writes (instead of the current approach of
      blindly updating the sample period)

Updating the sample period is the only interface for KVM to configure hw
behaviour on hw-ctr. I note that perf_event_set_count() will be proposed,
and I'm pessimistic about this change.

   4) Pause the counter when stopping the perf_event to ensure pmc->counter is
      fresh (instead of manually updating pmc->counter)

IMO, that yields more intuitive logic, and makes it easier to reason about
correctness since the behavior is easily define: pmc->counter holds the counts
that have been gathered and processed, perf_event and emulated_counter hold
outstanding counts on top.  E.g. on a WRMSR to the counter, both the emulated
counter and the hardware counter are reset, because whatever counts existed
previously are irrelevant.

If we take the hardware view, a counter, emulated or not, just increments
and overflows at the threshold. The missing logic here is when the counter
is truncated when writing high bit-width values, and how to deal with the
value of pmc->prev_counter was before pmc->counter was truncated.

Pausing the counter_might_  make WRMSR slower, but we need to get this all
functionally correct before worrying too much about performance.

Performance, security and correctness should all be considered at the beginning.

Diff below for what I'm thinking (needs to be split into multiple patches).  It's
*very*  lightly tested.

It saddens me that no one has come up with an actual low-level counter-test for 
this issue.

I'm about to disappear for a week, I'll pick this back up when I get return.  In
the meantime, any testing and/or input would be much appreciated!

How about accepting Roman's original fix and then exercising the rewriting genius ?