Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs

Jim Mattson <jmattson@xxxxxxxxxx> · Wed, 4 Dec 2024 04:30:54 -0800

Here is the sordid history behind the proposed APERFMPERF implementation.

First of all, I have never considered this feature for only
"slice-of-hardware" VMs, for two reasons:
(1) The original feature request was first opened in 2015, six months
before I joined Google, and long before Google Cloud had anything
remotely resembling a slice-of-hardware offering.
(2) One could argue that Google Cloud still has no slice-of-hardware
offering today.
Hence, an implementation that only works for "slice-of-hardware" is
essentially a science fair project. We might learn a lot, but there is
no ROI.

I dragged my feet on this for a long time, because
(1) Without actual guest C-state control, it seems essentially
pointless (though I probably didn't give sufficient weight to the warm
fuzzy feeling it might give customers).
(2) It's one of those things that's impossible to virtualize with
precision, and I can be a real pedant sometimes.
(3) I didn't want to expose a power side-channel that could be used to
spy on other tenants.

In 2019, Google Cloud launched the C2 VM family, with MWAIT-exiting
disabled for whole socket shapes. Though MWAIT hints aren't full
C-state control, and we still had the 1 MHz host timer tick that would
probably keep the guest out of deep C-states, my first objection
started to collapse. As I softened in my old age, the second objection
seemed indefensible, especially after I finally caved on nested posted
interrupt processing, which truly is unvirtualizable. But, especially
after the whole Meltdown/Spectre debacle, I was holding firm to my
third objection, despite counter-arguments that the same information
could be obtained without APERFMPERF. I guess I'm still smarting from
being proven completely wrong about RowHammer.

Finally, in December 2021, I thought I had a reasonable solution. We
could implement APERFMPERF in userspace, and the low fidelity would
make me feel comfortable about my third objection. "How would
userspace get this information," you may ask. Well, Google Cloud has
been carrying local patches to log guest {APERF, MPERF, TSC} deltas
since Ben Serebrin added it in 2017. Though the design document only
stipulated that the MSRs should be sampled at VMRUN entry and exit,
the original code actually sampled at VM-entry and VM-exit, with a
limitation of sampling at those additional points only if 500
microseconds had elapsed since the last samples were taken. Ben
calculated the effective frequency at each sample point to populate a
histogram, but that's not really relevant to APERFMPERF
virtualization. I just mention it to explain why those
VM-entry/VM-exit sample points were there. This code accounted for
everything between vcpu_load() and vcpu_put() when accumulating
"guest" APERF and MPERF, and this data eventually formed the basis of
our userspace implementation of APERFMPERF virtualization.

In 2022, Riley Gamson implemented APERFMPERF virtualization in
userspace, using KVM_X86_SET_MSR_FILTER to intercept guest accesses to
the MSRs, and using Ben's "turbostat" data to derive the values to be
returned. The APERF delta could be used as-is, but I was insistent
that MPERF needed to track guest TSC cycles while the vCPU was not
halted. My reasoning was this:
(1) The specification says so. Okay; it actually says that MPERF
"[i]ncrements at fixed interval (relative to TSC freq.) when the
logical processor is in C0," but even turbostat makes the
architecturally prohibited assumption that MPERF and TSC tick at the
same frequency.
(2) It would be disingenuous to claim the effective frequency *while
running* for a duty-cycle limited f1-micro or g2-small VM, or for
overcommitted VMs that are forced to timeshare with other tenants.

APERF is admittedly tricky to virtualize. For instance, how many
virtual "core clock counts at the coordinated clock frequency" should
we count while KVM is emulating CPUID? That's unclear. We're certainly
not trying to *emulate* APERF, so the number of cycles the physical
CPU takes to execute the instruction isn't relevant. The virtual CPU
just happens to take a few thousand cycles to execute CPUID. Consider
it a quirk. Similarly, in the case of zswap, some memory accesses take
a *really* long time. Or consider KVM time as the equivalent of SMM
time on physical hardware. SMM cycles are accumulated by APERF. It may
seem like a memory access just took 60 *milliseconds*, but most of
that time was spent in SMM. (That's taken from a recent real-world
example.) As much as I hate SMM, it provides a convenient rug to sweep
virtualization holes under.

At this point, I should note that Aaron Lewis eliminated the
rate-limited "turbostat" sampling at VM-entry/VM-exit early this year,
because it was too expensive. Admittedly, most of the cost was
attributed to reading MSR_C6_CORE_RESIDENCY, which Drew Schmitt added
to Ben's sampling in 2018. But this did factor into my thinking
regarding cost.

The target intercept was the C3 VM family, which is not
"slice-of-hardware," and, ironically, does not disable MWAIT-exiting
even for full socket shapes (because we realized after launching C2
that that was a huge mistake). However, the userspace approach was
abandoned before C3 launch, because of performance issues. You may
laugh, but when we advertised APERFMPERF to Linux guests, we were
surprised to find that every vCPU started sampling these MSRs every
timer tick. I still haven't looked into why. I'm assuming it has
something to do with a perception of "fairness" in scheduling, and I
just hope that it doesn't give power-hungry instruction mixes like
AVX-512 and AMX an even greater fraction of CPU time because their
effective frequency is low. In any case, we were seeing 10% to 16%
performance degradations when APERFMPERF was advertised to Linux
guests, and that was a non-starter. If that seems excessive, it is. A
large part of this is due to contention for an unfortunate exclusive
lock on the legacy PIC that our userspace grabs and releases for each
KVM_RUN ioctl. That could be fixed with a reader/writer lock, but the
point is that we were seeing KVM exits at a much higher rate than ever
before. I accept full responsibility for this debacle. I thought maybe
these MSRs would get sampled once a second while running turbostat. I
had no idea that the Linux kernel was so enamored of these MSRs.

Just doing a back-of-the-envelope calculation based on a 250 Hz guest
tick and 50000 cycles for a KVM exit, this implementation was going to
cost 1% or more in guest performance. We certainly couldn't enable it
by default, but maybe we could enable it for the specific customers
who had been clamoring for the feature. However, when I asked Product
Management how much performance customers were willing to trade for
this feature, the answer was "none."

Okay. How do we achieve that? The obvious approach is to intercept
reads of these MSRs and do some math in KVM. I found that really
unpalatable, though. For most of our VM families, the dominant source
of consistent background VM-exits is the host timer tick. The second
highest source is the guest timer tick. With the host timer tick
finally going away on the C4 VM family, the guest timer tick now
dominates. On Intel parts, where we take advantage of hardware EOI
virtualization, we now have two VM-exits per guest timer tick (one for
writing the x2APIC initial count MSR, and one for the VMX-preemption
timer). I couldn't defend doubling that with intercepted reads of
APERF and MPERF.

Well, what about the simple hack of passing through the host values? I
had considered this, despite the fact that it would only work for
slice-of-hardware. I even coerced Josh Don into "fixing" our scheduler
so that it wouldn't allow two vCPU threads (a virtual hyperthread
pair) to flip-flop between hyperthreads on their assigned physical
core. However, I eventually dismissed this as
(1) too much of a hack
(2) broken with live migration
(3) disingenuous when multiple tasks are running on the logical processor.

Yes, (3) does happen, even with our C4 VM family. During copyless
migration, two vCPU threads share a logical processor. During live
migration, I believe the live migration threads compete with vCPU
threads. And there is still at least one kworker thread competing for
cycles.

Actually writing the guest values into the MSRs was initially
abhorrent to me, because of the inherent lossage on every write. But,
I eventually got over it, and partially assuaged my revulsion by
writing the MSRs infrequently. I would much have preferred APERF and
MPERF equivalents of IA32_TSC_ADJUST, but I don't have the patience to
wait for the CPU vendors. BTW, as an aside, just how is AMD's scaling
of MPERF by the TSC_RATIO MSR even remotely useful without an offset?

One requirement I refuse to budge on is that host usage of APERFMPERF
must continue to work exactly as before, modulo some very small loss
of precision. Most of the damage could be contained within KVM, if
you're willing to accept the constraint that these MSRs cannot be
accessed within an NMI handler (on Intel CPUs), but then you have to
swap the guest and host values every VM-entry/VM-exit. This approach
increases both the performance overhead (for which our budget is
"none") and the loss of precision over the current approach. Given the
amount of whining on this list over writing just one MSR on every
VM-entry/VM-exit (yes, IA32_SPEC_CTRL, I'm looking at you), I didn't
think it would be very popular to add two. And, to be honest, I
remembered that rate-limited *reads* of the "turbostat" MSRs were too
expensive, but I had forgotten that the real culprit there was the
egregiously slow MSR_C6_CORE_RESIDENCY.

I do recall the hullabaloo regarding KVM's usurpation of IA32_TSC_AUX,
an affront that went unnoticed until the advent of RDPID. Honestly,
that's where I expected pushback on this series. Initially, I tried to
keep the changes outwith KVM to the minimum possible, replacing only
explicit reads of APERF or MPERF with the new accessor functions. I
wasn't going to touch the /dev/cpu/*/msr/* interface. After all, none
of the other KVM userspace return MSRs do anything special there. But,
then I discovered that turbostat on the host uses that interface, and
I really wanted that tool to continue to work as expected. So, the
rdmsr crosscalls picked up an ugly wart. Frankly, that was the
specific patch that I expected to unleash vitriol. As an aside, why
does turbostat have to use that interface for its own independent
reads of these MSRs, when the kernel is already reading them every
scheduler tick?!? Sure, for tickless kernels, maybe, but I digress.

Wherever the context-switching happens, I contend that there is no
"clean" virtualization of APERF. If it comes down to just a question
of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some
performance numbers and try to come to a consensus, but if you're
fundamentally opposed to virtualizing APERF, because it's messy, then
I don't see any point in pursuing this further.

Thanks,

--jim