Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 13 Jan 2025 19:15:19 +0000

On Wed, Dec 18, 2024, Jim Mattson wrote:
> On Fri, Dec 6, 2024 at 8:34 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> As we discussed off-list, it appears that the primary motivation for
> this change was to minimize the crosscalls executed when examining
> /proc/cpuinfo. I don't really think that use case justifies reading
> these MSRs *every scheduler tick*, but I'm admittedly biased.

Heh, yeah, we missed that boat by ~2 years.  Or maybe KVM's "slow" emulation
would only have further angered the x86 maintainers :-)

> 1. Guest Requirements
> 
> Unlike vPMU, which is primarily a development tool, our customers want
> APERFMPERF enabled on their production VMs, and they are unwilling to
> trade any amount of performance for the feature. They don't want
> frequency-invariant scheduling; they just want to observe the
> effective frequency (possibly via /proc/cpuinfo).
> 
> These requests are not limited to slice-of-hardware VMs. No one can
> tell me what customers expect with respect to KVM "steal time," but it
> seems to me that it would be disingenuous to ignore "steal time." By
> analogy with HDC, the effective frequency should drop to zero when the
> vCPU is "forced idle."
> 
> 2. Host Requirements
> 
> The host needs reliable APERF/MPERF access for:
> - Frequency-invariant scheduling
> - Monitoring through /proc/cpuinfo
> - Turbostat, maybe?
> 
> Our goal was for host APERFMPERF to work as it always has, counting
> both host cycles and guest cycles. We lose cycles on every WRMSR, but
> most of the time, the loss should be very small relative to the
> measurement.
> 
> To be honest, we had not even considered breaking APERF/MPERF on the
> host. We didn't think such an approach would have any chance of
> upstream acceptance.

FWIW, my stance on gifting features to KVM guests is that it's a-ok so long as it
requires an explicit opt-in from the system admin, and that it's decoupled from
KVM.  E.g. add a flag (or KConfig) to disable APERF/MPERF usage, at which point
there's no good reason to prevent KVM from virtualizing the feature.

Unfortunately, my idea of hiding a feature from the kernel has never panned out,
because apparently there's no feature that Linux can't squeeze some amount of
usefulness out of.  :-)

> 3. Design Choices
> 
> We evaluated three approaches:
> 
> a) Userspace virtualization via MSR filtering
> 
>    This approach was implemented before we knew about
>    frequency-invariant scheduling. Because of the frequent guest
>    reads, we observed a 10-16% performance hit, depending on vCPU
>    count. The performance impact was exacerbated by contention for a
>    legacy PIC mutex on KVM_RUN, but even if the mutex were replaced
>    with a reader/writer lock, the performance impact would be too
>    high. Hence, we abandoned this approach.
> 
> b) KVM intercepts RDMSR of APERF/MPERF
> 
>    This approach was ruled out by back-of-the-envelope
>    calculation. We're not going to be able to provide this feature for
>    free, but we could argue that 0.01% overhead is negligible. On a 2
>    GHz processor that gives us a budget of 200,000 cycles per
>    second. With a 250 Hz guest tick generating 500 RDMSR intercepts
>    per second, we have a budget of just 400 cycles per
>    intercept. That's likely to be insufficient for most platforms. A
>    guest with CONFIG_HZ_1000 would drop the budget to just 100 cycles
>    per intercept. That's unachievable.

I think we'd actually have a bit more headroom.  The overhead would be relative
to bare metal, not absolute.  RDMSR is typically ~80 cycles, so even if we are
super duper strict in how that 0.01% overhead is accounted, KVM would have more
like 150+ cycles?  But I'm mostly just being pedantic, I'm pretty sure AMD CPUs
can't achieve 400 cycle roundtrips, i.e. hardware alone would exhaust the budget.

>    We should have a discussion about just how much overhead is
>    negligible, and that may open the door to other implementation
>    options.
> 
> c) Current RDMSR pass-through approach
> 
>    The biggest downside is the loss of cycles on every WRMSR. An NMI
>    or SMI in the critical region could result in millions of lost
>    cycles. However, the damage only persists until all in-progress
>    measurements are completed.

FWIW, the NMI problem is solvable, e.g. by bumping a sequence counter if the CPU
takes an NMI in the critical section, and then retrying until there are no NMIs
(or maybe retry a very limited number of times to avoid creating a set of problems
that could be worse than the loss in accuracy).

>    We had considered context-switching host and guest values on
>    VM-entry and VM-exit. This would have kept everything within KVM,
>    as long as the host doesn't access the MSRs during an NMI or
>    SMI. However, 4 additional RDMSRs and 4 additional WRMSRs on a
>    VM-enter/VM-exit round-trip would have blown the budget. Even
>    without APERFMPERF, an active guest vCPU takes a minimum of two
>    VM-exits per timer tick, so we have even less budget per
>    VM-enter/VM-exit round-trip than we had per RDMSR intercept in (b).
> 
>    Internally, we have already moved the mediated vPMU context-switch
>    from VM-entry/VM-exit to the KVM_RUN loop boundaries, so it seemed
>    natural to do the same for APERFMPERF. I don't have a
>    back-of-the-envelope calculation for this overhead, but I have run
>    Virtuozzo's cpuid_rate benchmark in a guest with and without
>    APERFMPERF, 100 times for each configuration, and a Student's
>    t-test showed that there is no statistically significant difference
>    between the means of the two datasets.
> 
> 4. APERF/MPERF Accounting
> 
>    Virtual MPERF cycles are easy to define. They accumulate at the
>    virtual TSC frequency as long as the vCPU is in C0. There are only
>    a few ways the vCPU can leave C0. If HLT or MWAIT exiting is
>    disabled, then the vCPU can leave C) in VMX non-root operation (or
>    AMD guest mode). If HLT exiting is not disabled, then the vCPU will
>    leave C0 when a HLT instruction is intercepted, and it will reenter
>    C0 when it receives an interrupt (or a PV kick) and starts running
>    again.
> 
>    Virtual APERF cycles are more ambiguous, especially in VMX root
>    operation (or AMD host mode). I think we can all agree that they
>    should accumulate at some non-zero rate as long as the code being
>    executed on the logical processor contributes in some way to guest
>    vCPU progress, but should the virtual APERF accumulate cycles at
>    the same frequency as the physical APERF? Probably not. Ultimately,
>    the decision was pragmatic. Virtual APERF accumulates at the same
>    rate as physical APERF while the guest context is live in the
>    MSR. Doing anything else would have been too expensive.

Hmm, I'm ok stopping virtual APERF while the vCPU task is in userspace, and the
more I poke at it, the more I agree it's the only sane approach.  However, I most
definitely want to document the various gotchas with the alternative.

At first glance, keeping KVM's preempt notifier registered on exits to userspace
would be very doable, but there are lurking complexities that make it very
unpalatable when digging deeper.  E.g. handling the case where userspace
invokes KVM_RUN on a different task+CPU would likely require a per-CPU spinlock,
which is all kinds of gross.  And userspace would need a way to disassociated a
task from a vCPU.

Maybe this would be a good candidate for Paolo's idea of using the merge commit
to capture information that doesn't belong in Documentation, but that is too
specific/detailed for a single commit's changelog.

> 5. Live Migration
> 
>    The IA32_MPERF MSR is serialized independently of the
>    IA32_TIME_STAMP_COUNTER MSR. Yes, this means that the two MSRs do
>    not advance in lock step across live migration, but this is no
>    different from a general purpose vPMU counter programmed to count
>    "unhalted reference cycles." In general, our implementation of
>    guest IA32_MPERF is far superior to the vPMU implementation of
>    "unhalted reference cycles."

Aha!  The SDM gives us an out:

  Only the IA32_APERF/IA32_MPERF ratio is architecturally defined; software should
  not attach meaning to the content of the individual of IA32_APERF or IA32_MPERF
  MSRs.

While the SDM kinda sorta implies that MPERF and TSC will operrate in lock-step,
the above gives me confidence that some amount of drift is tolerable.

Off-list you floated the idea of tying save/restore to TSC as an offset, but I
think that's unnecessary complexity on two fronts.  First, the writes to TSC and
MPERF must happen separately, so even if KVM does back-to-back WRMSRs, some amount
of drift is inevitable.  Second, because virtual TSC doesn't stop on vcpu_{load,put},
there will be non-trivial drift irrespective of migration (and it might even be
worse?).

> 6. Guest TSC Scaling
> 
>    It is not possible to support TSC scaling with IA32_MPERF
>    RDMSR-passthrough on Intel CPUs, because reads of IA32_MPERF in VMX
>    non-root operation are not scaled by the hardware. It is possible
>    to support TSC scaling with IA32_MPERF RDMSR-passthrough on AMD
>    CPUs, but the implementation is left as an exercise for the reader.

So, what's the proposed solution?  Either the limitation needs to be documented
as a KVM erratum, or KVM needs to actively prevent APERF/MPREF virtualization if
TSC scaling is in effect.  I can't think of a third option off the top of my
head.

I'm not sure how I feel about taking an erratum for this one.  The SDM explicitly
states, in multiple places, that MPREF counts at a fixed frequency, e.g.

  IA32_MPERF MSR (E7H) increments in proportion to a fixed frequency, which is
  configured when the processor is booted.

Drift between TSC and MPERF is one thing, having MPERF suddenly count at a
different frequency is problematic on a different level.