Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs

Jim Mattson <jmattson@xxxxxxxxxx> · Wed, 18 Dec 2024 14:23:03 -0800

On Fri, Dec 6, 2024 at 8:34 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, Dec 04, 2024, Jim Mattson wrote:
> > Wherever the context-switching happens, I contend that there is no
> > "clean" virtualization of APERF. If it comes down to just a question
> > of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some
> > performance numbers and try to come to a consensus, but if you're
> > fundamentally opposed to virtualizing APERF, because it's messy, then
> > I don't see any point in pursuing this further.
>
> I'm not fundamentally opposed to virtualizing the feature.  My complaints with
> the series are that it doesn't provide sufficient information to make it feasible
> for reviewers to provide useful feedback.  The history you provided is a great
> start, but that's still largely just background information.  For a feature as
> messy and subjective as APERF/MPERF, I think we need at least the following:
>
>   1. What use cases are being targeted (e.g. because targeting only SoH would
>      allow for a different implementation).
>   2. The exact requirements, especially with respect to host usage.  And the
>      the motivation behind those requirements.
>   3. The high level design choices, and what, if any, alternatives were considered.
>   4. Basic rules of thumb for what is/isn't accounted in APERF/MPERF, so that it's
>      feasible to actually maintain support long-term.
>
> E.g. does the host need to retain access to APERF/MPERF at all times?  If so, why?
> Do we care about host kernel accesses, e.g. in the scheduler, or just userspace
> accesses, e.g. turbostat?
>
> What information is the host intended to see?  E.g. should APERF and MPERF stop
> when transitioning to the guest?  If not, what are the intended semantics for the
> host's view when running VMs with HLT-exiting disabled?  If the host's view of
> APERF and MPREF account guest time, how does that mesh with upcoming mediated PMU,
> where the host is disallowed from observing the guest?
>
> Is there a plan for supporting VMs with a different TSC frequency than the host?
> How will live migration work, without generating too much slop/skew between MPERF
> and GUEST_TSC?
>
> I don't expect the series to answer every possible question upfront, but the RFC
> provided _nothing_, just a "here's what we implemented, please review".

Sean,

Thanks for the thoughtful pushback. You're right: the RFC needs more
context about our design choices and rationale.

My response has been delayed as I've been researching the reads of
IA32_APERF and IA32_MPERF on every scheduler tick. Thanks for pointing
me to commit 1567c3e3467c ("x86, sched: Add support for frequency
invariance") off-list. Frequency-invariant scheduling was, in fact,
the original source of these RDMSRs.

The code in the aforementioned commit would have allowed us to
virtualize APERFMPERF without triggering these frequent MSR reads,
because arch_scale_freq_tick() had an early return if
arch_scale_freq_invariant() was false. And since KVM does not
virtualize MSR_TURBO_RATIO_LIMIT, arch_scale_freq_invariant() would be
false in a VM.

However, in commit bb6e89df9028 ("x86/aperfmperf: Make parts of the
frequency invariance code unconditional"), the early return was
weakened to only bail out if
cpu_feature_enabled(X86_FEATURE_APERFMPERF) is false. Hence, even if
frequency-invariant scheduling is disabled, Linux will read the MSRs
every scheduler tick when APERFMPERF is available.

As we discussed off-list, it appears that the primary motivation for
this change was to minimize the crosscalls executed when examining
/proc/cpuinfo. I don't really think that use case justifies reading
these MSRs *every scheduler tick*, but I'm admittedly biased.

1. Guest Requirements

Unlike vPMU, which is primarily a development tool, our customers want
APERFMPERF enabled on their production VMs, and they are unwilling to
trade any amount of performance for the feature. They don't want
frequency-invariant scheduling; they just want to observe the
effective frequency (possibly via /proc/cpuinfo).

These requests are not limited to slice-of-hardware VMs. No one can
tell me what customers expect with respect to KVM "steal time," but it
seems to me that it would be disingenuous to ignore "steal time." By
analogy with HDC, the effective frequency should drop to zero when the
vCPU is "forced idle."

2. Host Requirements

The host needs reliable APERF/MPERF access for:
- Frequency-invariant scheduling
- Monitoring through /proc/cpuinfo
- Turbostat, maybe?

Our goal was for host APERFMPERF to work as it always has, counting
both host cycles and guest cycles. We lose cycles on every WRMSR, but
most of the time, the loss should be very small relative to the
measurement.

To be honest, we had not even considered breaking APERF/MPERF on the
host. We didn't think such an approach would have any chance of
upstream acceptance.

3. Design Choices

We evaluated three approaches:

a) Userspace virtualization via MSR filtering

   This approach was implemented before we knew about
   frequency-invariant scheduling. Because of the frequent guest
   reads, we observed a 10-16% performance hit, depending on vCPU
   count. The performance impact was exacerbated by contention for a
   legacy PIC mutex on KVM_RUN, but even if the mutex were replaced
   with a reader/writer lock, the performance impact would be too
   high. Hence, we abandoned this approach.

b) KVM intercepts RDMSR of APERF/MPERF

   This approach was ruled out by back-of-the-envelope
   calculation. We're not going to be able to provide this feature for
   free, but we could argue that 0.01% overhead is negligible. On a 2
   GHz processor that gives us a budget of 200,000 cycles per
   second. With a 250 Hz guest tick generating 500 RDMSR intercepts
   per second, we have a budget of just 400 cycles per
   intercept. That's likely to be insufficient for most platforms. A
   guest with CONFIG_HZ_1000 would drop the budget to just 100 cycles
   per intercept. That's unachievable.

   We should have a discussion about just how much overhead is
   negligible, and that may open the door to other implementation
   options.

c) Current RDMSR pass-through approach

   The biggest downside is the loss of cycles on every WRMSR. An NMI
   or SMI in the critical region could result in millions of lost
   cycles. However, the damage only persists until all in-progress
   measurements are completed.

   We had considered context-switching host and guest values on
   VM-entry and VM-exit. This would have kept everything within KVM,
   as long as the host doesn't access the MSRs during an NMI or
   SMI. However, 4 additional RDMSRs and 4 additional WRMSRs on a
   VM-enter/VM-exit round-trip would have blown the budget. Even
   without APERFMPERF, an active guest vCPU takes a minimum of two
   VM-exits per timer tick, so we have even less budget per
   VM-enter/VM-exit round-trip than we had per RDMSR intercept in (b).

   Internally, we have already moved the mediated vPMU context-switch
   from VM-entry/VM-exit to the KVM_RUN loop boundaries, so it seemed
   natural to do the same for APERFMPERF. I don't have a
   back-of-the-envelope calculation for this overhead, but I have run
   Virtuozzo's cpuid_rate benchmark in a guest with and without
   APERFMPERF, 100 times for each configuration, and a Student's
   t-test showed that there is no statistically significant difference
   between the means of the two datasets.

4. APERF/MPERF Accounting

   Virtual MPERF cycles are easy to define. They accumulate at the
   virtual TSC frequency as long as the vCPU is in C0. There are only
   a few ways the vCPU can leave C0. If HLT or MWAIT exiting is
   disabled, then the vCPU can leave C) in VMX non-root operation (or
   AMD guest mode). If HLT exiting is not disabled, then the vCPU will
   leave C0 when a HLT instruction is intercepted, and it will reenter
   C0 when it receives an interrupt (or a PV kick) and starts running
   again.

   Virtual APERF cycles are more ambiguous, especially in VMX root
   operation (or AMD host mode). I think we can all agree that they
   should accumulate at some non-zero rate as long as the code being
   executed on the logical processor contributes in some way to guest
   vCPU progress, but should the virtual APERF accumulate cycles at
   the same frequency as the physical APERF? Probably not. Ultimately,
   the decision was pragmatic. Virtual APERF accumulates at the same
   rate as physical APERF while the guest context is live in the
   MSR. Doing anything else would have been too expensive.

5. Live Migration

   The IA32_MPERF MSR is serialized independently of the
   IA32_TIME_STAMP_COUNTER MSR. Yes, this means that the two MSRs do
   not advance in lock step across live migration, but this is no
   different from a general purpose vPMU counter programmed to count
   "unhalted reference cycles." In general, our implementation of
   guest IA32_MPERF is far superior to the vPMU implementation of
   "unhalted reference cycles."

6. Guest TSC Scaling

   It is not possible to support TSC scaling with IA32_MPERF
   RDMSR-passthrough on Intel CPUs, because reads of IA32_MPERF in VMX
   non-root operation are not scaled by the hardware. It is possible
   to support TSC scaling with IA32_MPERF RDMSR-passthrough on AMD
   CPUs, but the implementation is left as an exercise for the reader.