On Tue, Oct 03, 2023, Mingwei Zhang wrote: > On Mon, Oct 2, 2023 at 5:56 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > The "when" is what's important. If KVM took a literal interpretation of > > "exclude guest" for pass-through MSRs, then KVM would context switch all those > > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a > > reschedule IRQ to schedule in a different task (or vCPU). The overhead to save > > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every > > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in > > <1500 cycles, and "fastpath" exits are something like half that. Switching all > > the MSRs is likely 1000+ cycles, if not double that. > > Hi Sean, > > Sorry, I have no intention to interrupt the conversation, but this is > slightly confusing to me. > > I remember when doing AMX, we added gigantic 8KB memory in the FPU > context switch. That works well in Linux today. Why can't we do the > same for PMU? Assuming we context switch all counters, selectors and > global stuff there? That's what we (Google folks) are proposing. However, there are significant side effects if KVM context switches PMU outside of vcpu_run(), whereas the FPU doesn't suffer the same problems. Keeping the guest FPU resident for the duration of vcpu_run() is, in terms of functionality, completely transparent to the rest of the kernel. From the kernel's perspective, the guest FPU is just a variation of a userspace FPU, and the kernel is already designed to save/restore userspace/guest FPU state when the kernel wants to use the FPU for whatever reason. And crucially, kernel FPU usage is explicit and contained, e.g. see kernel_fpu_{begin,end}(), and comes with mechanisms for KVM to detect when the guest FPU needs to be reloaded (see TIF_NEED_FPU_LOAD). The PMU is a completely different story. PMU usage, a.k.a. perf, by design is "always running". KVM can't transparently stop host usage of the PMU, as disabling host PMU usage stops perf events from counting/profiling whatever it is they're supposed to profile. Today, KVM minimizes the "downtime" of host PMU usage by context switching PMU state at VM-Enter and VM-Exit, or at least as close as possible, e.g. for LBRs and Intel PT. What we are proposing would *significantly* increase the downtime, to the point where it would almost be unbounded in some paths, e.g. if KVM faults in a page, gup() could go swap in memory from disk, install PTEs, and so on and so forth. If the host is trying to profile something related to swap or memory management, they're out of luck.