On Wed, Dec 18, 2024, Jim Mattson wrote: > On Fri, Dec 6, 2024 at 8:34 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > As we discussed off-list, it appears that the primary motivation for > this change was to minimize the crosscalls executed when examining > /proc/cpuinfo. I don't really think that use case justifies reading > these MSRs *every scheduler tick*, but I'm admittedly biased. Heh, yeah, we missed that boat by ~2 years. Or maybe KVM's "slow" emulation would only have further angered the x86 maintainers :-) > 1. Guest Requirements > > Unlike vPMU, which is primarily a development tool, our customers want > APERFMPERF enabled on their production VMs, and they are unwilling to > trade any amount of performance for the feature. They don't want > frequency-invariant scheduling; they just want to observe the > effective frequency (possibly via /proc/cpuinfo). > > These requests are not limited to slice-of-hardware VMs. No one can > tell me what customers expect with respect to KVM "steal time," but it > seems to me that it would be disingenuous to ignore "steal time." By > analogy with HDC, the effective frequency should drop to zero when the > vCPU is "forced idle." > > 2. Host Requirements > > The host needs reliable APERF/MPERF access for: > - Frequency-invariant scheduling > - Monitoring through /proc/cpuinfo > - Turbostat, maybe? > > Our goal was for host APERFMPERF to work as it always has, counting > both host cycles and guest cycles. We lose cycles on every WRMSR, but > most of the time, the loss should be very small relative to the > measurement. > > To be honest, we had not even considered breaking APERF/MPERF on the > host. We didn't think such an approach would have any chance of > upstream acceptance. FWIW, my stance on gifting features to KVM guests is that it's a-ok so long as it requires an explicit opt-in from the system admin, and that it's decoupled from KVM. E.g. add a flag (or KConfig) to disable APERF/MPERF usage, at which point there's no good reason to prevent KVM from virtualizing the feature. Unfortunately, my idea of hiding a feature from the kernel has never panned out, because apparently there's no feature that Linux can't squeeze some amount of usefulness out of. :-) > 3. Design Choices > > We evaluated three approaches: > > a) Userspace virtualization via MSR filtering > > This approach was implemented before we knew about > frequency-invariant scheduling. Because of the frequent guest > reads, we observed a 10-16% performance hit, depending on vCPU > count. The performance impact was exacerbated by contention for a > legacy PIC mutex on KVM_RUN, but even if the mutex were replaced > with a reader/writer lock, the performance impact would be too > high. Hence, we abandoned this approach. > > b) KVM intercepts RDMSR of APERF/MPERF > > This approach was ruled out by back-of-the-envelope > calculation. We're not going to be able to provide this feature for > free, but we could argue that 0.01% overhead is negligible. On a 2 > GHz processor that gives us a budget of 200,000 cycles per > second. With a 250 Hz guest tick generating 500 RDMSR intercepts > per second, we have a budget of just 400 cycles per > intercept. That's likely to be insufficient for most platforms. A > guest with CONFIG_HZ_1000 would drop the budget to just 100 cycles > per intercept. That's unachievable. I think we'd actually have a bit more headroom. The overhead would be relative to bare metal, not absolute. RDMSR is typically ~80 cycles, so even if we are super duper strict in how that 0.01% overhead is accounted, KVM would have more like 150+ cycles? But I'm mostly just being pedantic, I'm pretty sure AMD CPUs can't achieve 400 cycle roundtrips, i.e. hardware alone would exhaust the budget. > We should have a discussion about just how much overhead is > negligible, and that may open the door to other implementation > options. > > c) Current RDMSR pass-through approach > > The biggest downside is the loss of cycles on every WRMSR. An NMI > or SMI in the critical region could result in millions of lost > cycles. However, the damage only persists until all in-progress > measurements are completed. FWIW, the NMI problem is solvable, e.g. by bumping a sequence counter if the CPU takes an NMI in the critical section, and then retrying until there are no NMIs (or maybe retry a very limited number of times to avoid creating a set of problems that could be worse than the loss in accuracy). > We had considered context-switching host and guest values on > VM-entry and VM-exit. This would have kept everything within KVM, > as long as the host doesn't access the MSRs during an NMI or > SMI. However, 4 additional RDMSRs and 4 additional WRMSRs on a > VM-enter/VM-exit round-trip would have blown the budget. Even > without APERFMPERF, an active guest vCPU takes a minimum of two > VM-exits per timer tick, so we have even less budget per > VM-enter/VM-exit round-trip than we had per RDMSR intercept in (b). > > Internally, we have already moved the mediated vPMU context-switch > from VM-entry/VM-exit to the KVM_RUN loop boundaries, so it seemed > natural to do the same for APERFMPERF. I don't have a > back-of-the-envelope calculation for this overhead, but I have run > Virtuozzo's cpuid_rate benchmark in a guest with and without > APERFMPERF, 100 times for each configuration, and a Student's > t-test showed that there is no statistically significant difference > between the means of the two datasets. > > 4. APERF/MPERF Accounting > > Virtual MPERF cycles are easy to define. They accumulate at the > virtual TSC frequency as long as the vCPU is in C0. There are only > a few ways the vCPU can leave C0. If HLT or MWAIT exiting is > disabled, then the vCPU can leave C) in VMX non-root operation (or > AMD guest mode). If HLT exiting is not disabled, then the vCPU will > leave C0 when a HLT instruction is intercepted, and it will reenter > C0 when it receives an interrupt (or a PV kick) and starts running > again. > > Virtual APERF cycles are more ambiguous, especially in VMX root > operation (or AMD host mode). I think we can all agree that they > should accumulate at some non-zero rate as long as the code being > executed on the logical processor contributes in some way to guest > vCPU progress, but should the virtual APERF accumulate cycles at > the same frequency as the physical APERF? Probably not. Ultimately, > the decision was pragmatic. Virtual APERF accumulates at the same > rate as physical APERF while the guest context is live in the > MSR. Doing anything else would have been too expensive. Hmm, I'm ok stopping virtual APERF while the vCPU task is in userspace, and the more I poke at it, the more I agree it's the only sane approach. However, I most definitely want to document the various gotchas with the alternative. At first glance, keeping KVM's preempt notifier registered on exits to userspace would be very doable, but there are lurking complexities that make it very unpalatable when digging deeper. E.g. handling the case where userspace invokes KVM_RUN on a different task+CPU would likely require a per-CPU spinlock, which is all kinds of gross. And userspace would need a way to disassociated a task from a vCPU. Maybe this would be a good candidate for Paolo's idea of using the merge commit to capture information that doesn't belong in Documentation, but that is too specific/detailed for a single commit's changelog. > 5. Live Migration > > The IA32_MPERF MSR is serialized independently of the > IA32_TIME_STAMP_COUNTER MSR. Yes, this means that the two MSRs do > not advance in lock step across live migration, but this is no > different from a general purpose vPMU counter programmed to count > "unhalted reference cycles." In general, our implementation of > guest IA32_MPERF is far superior to the vPMU implementation of > "unhalted reference cycles." Aha! The SDM gives us an out: Only the IA32_APERF/IA32_MPERF ratio is architecturally defined; software should not attach meaning to the content of the individual of IA32_APERF or IA32_MPERF MSRs. While the SDM kinda sorta implies that MPERF and TSC will operrate in lock-step, the above gives me confidence that some amount of drift is tolerable. Off-list you floated the idea of tying save/restore to TSC as an offset, but I think that's unnecessary complexity on two fronts. First, the writes to TSC and MPERF must happen separately, so even if KVM does back-to-back WRMSRs, some amount of drift is inevitable. Second, because virtual TSC doesn't stop on vcpu_{load,put}, there will be non-trivial drift irrespective of migration (and it might even be worse?). > 6. Guest TSC Scaling > > It is not possible to support TSC scaling with IA32_MPERF > RDMSR-passthrough on Intel CPUs, because reads of IA32_MPERF in VMX > non-root operation are not scaled by the hardware. It is possible > to support TSC scaling with IA32_MPERF RDMSR-passthrough on AMD > CPUs, but the implementation is left as an exercise for the reader. So, what's the proposed solution? Either the limitation needs to be documented as a KVM erratum, or KVM needs to actively prevent APERF/MPREF virtualization if TSC scaling is in effect. I can't think of a third option off the top of my head. I'm not sure how I feel about taking an erratum for this one. The SDM explicitly states, in multiple places, that MPREF counts at a fixed frequency, e.g. IA32_MPERF MSR (E7H) increments in proportion to a fixed frequency, which is configured when the processor is booted. Drift between TSC and MPERF is one thing, having MPERF suddenly count at a different frequency is problematic on a different level.