Fix a hilarious/revolting performance regression (relative to older CPU generations) in xstate_required_size() that pops up due to CPUID _in the host_ taking 3x-4x longer on Emerald Rapids than Skylake. The issue rears its head on nested virtualization transitions, as KVM (unnecessarily) performs runtime CPUID updates, including XSAVE sizes, multiple times per transition. And calculating XSAVE sizes, especially for vCPUs with a decent number of supported XSAVE features and compacted format support, can add up to thousands of cycles. To fix the immediate issue, cache the CPUID output at kvm.ko load. The information is static for a given CPU, i.e. doesn't need to be re-read from hardware every time. That's patch 1, and eliminates pretty much all of the meaningful overhead. Patch 2 is a minor cleanup to try and make the code easier to read. Patch 3 fixes a wart in CPUID emulation where KVM does a moderately expensive (though cheap compared to CPUID, lol) MSR lookup that is likely unnecessary for the vast majority of VMs. Patches 4 and 5 address the problem of KVM doing runtime CPUID updates multiple times for each nested VM-Enter and VM-Exit, at least half of which are completely unnecessary (CPUID is a mandatory intercept on both Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running L2 is pointless). The idea is fairly simple: lazily do the CPUID updates by deferring them until something might actually consume guest the relevant bits. This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise conflict, and I didn't want to think about how safe patch 5 is without the rework. That said, patch 1, which is the most important and tagged for stable, applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and earlier shouldn't be too horrific). Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU or ucode bug. For a number of leaves, KVM can emulate CPUID faster than the CPUID can execute the instruction. I.e. the entire VM-Exit => emulate => VM-Enter sequence takes less time than executing CPUID on bare metal. Which seems absolutely insane. But, it shouldn't impact guest performance, so that's someone else's problem, at least for now. [*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@xxxxxxxxxx Sean Christopherson (5): KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init KVM: x86: Use for-loop to iterate over XSTATE size entries KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/cpuid.c | 63 ++++++++++++++++++++++++--------- arch/x86/kvm/cpuid.h | 10 +++++- arch/x86/kvm/lapic.c | 2 +- arch/x86/kvm/smm.c | 2 +- arch/x86/kvm/svm/sev.c | 2 +- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 22 +++++++++--- 9 files changed, 78 insertions(+), 28 deletions(-) base-commit: 06a4919e729761be751366716c00fb8c3f51d37e -- 2.47.0.338.g60cca15819-goog