On Fri, Dec 6, 2024 at 8:34 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Wed, Dec 04, 2024, Jim Mattson wrote: > > Wherever the context-switching happens, I contend that there is no > > "clean" virtualization of APERF. If it comes down to just a question > > of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some > > performance numbers and try to come to a consensus, but if you're > > fundamentally opposed to virtualizing APERF, because it's messy, then > > I don't see any point in pursuing this further. > > I'm not fundamentally opposed to virtualizing the feature. My complaints with > the series are that it doesn't provide sufficient information to make it feasible > for reviewers to provide useful feedback. The history you provided is a great > start, but that's still largely just background information. For a feature as > messy and subjective as APERF/MPERF, I think we need at least the following: > > 1. What use cases are being targeted (e.g. because targeting only SoH would > allow for a different implementation). > 2. The exact requirements, especially with respect to host usage. And the > the motivation behind those requirements. > 3. The high level design choices, and what, if any, alternatives were considered. > 4. Basic rules of thumb for what is/isn't accounted in APERF/MPERF, so that it's > feasible to actually maintain support long-term. > > E.g. does the host need to retain access to APERF/MPERF at all times? If so, why? > Do we care about host kernel accesses, e.g. in the scheduler, or just userspace > accesses, e.g. turbostat? > > What information is the host intended to see? E.g. should APERF and MPERF stop > when transitioning to the guest? If not, what are the intended semantics for the > host's view when running VMs with HLT-exiting disabled? If the host's view of > APERF and MPREF account guest time, how does that mesh with upcoming mediated PMU, > where the host is disallowed from observing the guest? > > Is there a plan for supporting VMs with a different TSC frequency than the host? > How will live migration work, without generating too much slop/skew between MPERF > and GUEST_TSC? > > I don't expect the series to answer every possible question upfront, but the RFC > provided _nothing_, just a "here's what we implemented, please review". Sean, Thanks for the thoughtful pushback. You're right: the RFC needs more context about our design choices and rationale. My response has been delayed as I've been researching the reads of IA32_APERF and IA32_MPERF on every scheduler tick. Thanks for pointing me to commit 1567c3e3467c ("x86, sched: Add support for frequency invariance") off-list. Frequency-invariant scheduling was, in fact, the original source of these RDMSRs. The code in the aforementioned commit would have allowed us to virtualize APERFMPERF without triggering these frequent MSR reads, because arch_scale_freq_tick() had an early return if arch_scale_freq_invariant() was false. And since KVM does not virtualize MSR_TURBO_RATIO_LIMIT, arch_scale_freq_invariant() would be false in a VM. However, in commit bb6e89df9028 ("x86/aperfmperf: Make parts of the frequency invariance code unconditional"), the early return was weakened to only bail out if cpu_feature_enabled(X86_FEATURE_APERFMPERF) is false. Hence, even if frequency-invariant scheduling is disabled, Linux will read the MSRs every scheduler tick when APERFMPERF is available. As we discussed off-list, it appears that the primary motivation for this change was to minimize the crosscalls executed when examining /proc/cpuinfo. I don't really think that use case justifies reading these MSRs *every scheduler tick*, but I'm admittedly biased. 1. Guest Requirements Unlike vPMU, which is primarily a development tool, our customers want APERFMPERF enabled on their production VMs, and they are unwilling to trade any amount of performance for the feature. They don't want frequency-invariant scheduling; they just want to observe the effective frequency (possibly via /proc/cpuinfo). These requests are not limited to slice-of-hardware VMs. No one can tell me what customers expect with respect to KVM "steal time," but it seems to me that it would be disingenuous to ignore "steal time." By analogy with HDC, the effective frequency should drop to zero when the vCPU is "forced idle." 2. Host Requirements The host needs reliable APERF/MPERF access for: - Frequency-invariant scheduling - Monitoring through /proc/cpuinfo - Turbostat, maybe? Our goal was for host APERFMPERF to work as it always has, counting both host cycles and guest cycles. We lose cycles on every WRMSR, but most of the time, the loss should be very small relative to the measurement. To be honest, we had not even considered breaking APERF/MPERF on the host. We didn't think such an approach would have any chance of upstream acceptance. 3. Design Choices We evaluated three approaches: a) Userspace virtualization via MSR filtering This approach was implemented before we knew about frequency-invariant scheduling. Because of the frequent guest reads, we observed a 10-16% performance hit, depending on vCPU count. The performance impact was exacerbated by contention for a legacy PIC mutex on KVM_RUN, but even if the mutex were replaced with a reader/writer lock, the performance impact would be too high. Hence, we abandoned this approach. b) KVM intercepts RDMSR of APERF/MPERF This approach was ruled out by back-of-the-envelope calculation. We're not going to be able to provide this feature for free, but we could argue that 0.01% overhead is negligible. On a 2 GHz processor that gives us a budget of 200,000 cycles per second. With a 250 Hz guest tick generating 500 RDMSR intercepts per second, we have a budget of just 400 cycles per intercept. That's likely to be insufficient for most platforms. A guest with CONFIG_HZ_1000 would drop the budget to just 100 cycles per intercept. That's unachievable. We should have a discussion about just how much overhead is negligible, and that may open the door to other implementation options. c) Current RDMSR pass-through approach The biggest downside is the loss of cycles on every WRMSR. An NMI or SMI in the critical region could result in millions of lost cycles. However, the damage only persists until all in-progress measurements are completed. We had considered context-switching host and guest values on VM-entry and VM-exit. This would have kept everything within KVM, as long as the host doesn't access the MSRs during an NMI or SMI. However, 4 additional RDMSRs and 4 additional WRMSRs on a VM-enter/VM-exit round-trip would have blown the budget. Even without APERFMPERF, an active guest vCPU takes a minimum of two VM-exits per timer tick, so we have even less budget per VM-enter/VM-exit round-trip than we had per RDMSR intercept in (b). Internally, we have already moved the mediated vPMU context-switch from VM-entry/VM-exit to the KVM_RUN loop boundaries, so it seemed natural to do the same for APERFMPERF. I don't have a back-of-the-envelope calculation for this overhead, but I have run Virtuozzo's cpuid_rate benchmark in a guest with and without APERFMPERF, 100 times for each configuration, and a Student's t-test showed that there is no statistically significant difference between the means of the two datasets. 4. APERF/MPERF Accounting Virtual MPERF cycles are easy to define. They accumulate at the virtual TSC frequency as long as the vCPU is in C0. There are only a few ways the vCPU can leave C0. If HLT or MWAIT exiting is disabled, then the vCPU can leave C) in VMX non-root operation (or AMD guest mode). If HLT exiting is not disabled, then the vCPU will leave C0 when a HLT instruction is intercepted, and it will reenter C0 when it receives an interrupt (or a PV kick) and starts running again. Virtual APERF cycles are more ambiguous, especially in VMX root operation (or AMD host mode). I think we can all agree that they should accumulate at some non-zero rate as long as the code being executed on the logical processor contributes in some way to guest vCPU progress, but should the virtual APERF accumulate cycles at the same frequency as the physical APERF? Probably not. Ultimately, the decision was pragmatic. Virtual APERF accumulates at the same rate as physical APERF while the guest context is live in the MSR. Doing anything else would have been too expensive. 5. Live Migration The IA32_MPERF MSR is serialized independently of the IA32_TIME_STAMP_COUNTER MSR. Yes, this means that the two MSRs do not advance in lock step across live migration, but this is no different from a general purpose vPMU counter programmed to count "unhalted reference cycles." In general, our implementation of guest IA32_MPERF is far superior to the vPMU implementation of "unhalted reference cycles." 6. Guest TSC Scaling It is not possible to support TSC scaling with IA32_MPERF RDMSR-passthrough on Intel CPUs, because reads of IA32_MPERF in VMX non-root operation are not scaled by the hardware. It is possible to support TSC scaling with IA32_MPERF RDMSR-passthrough on AMD CPUs, but the implementation is left as an exercise for the reader.