We have detected significant performance drop of our atomic test which checks the rate of CPUID instructions rate inside an L1 VM on an AMD node. Investigation led to 2 mainstream patches which have introduced extra events accounting: 018d70ffcfec ("KVM: x86: Update vPMCs when retiring branch instructions") 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") And on an AMD Zen 3 CPU that resulted in immediate 43% drop in the CPUID rate. Checking latest mainsteam kernel the performance difference is much less but still quite noticeable: 13.4% and shows up on AMD CPUs only. Looks like iteration over all PMCs in kvm_pmu_trigger_event() is cheap on Intel and expensive on AMD CPUs. So the idea behind this patch is to skip iterations over PMCs at all in case PMU is disabled for a VM completely or PMU is enabled for a VM, but there are no active PMCs at all. Unfortunately * current kernel code does not differentiate if PMU is globally enabled for a VM or not (pmu->version is always 1) * AMD CPUs older than Zen 4 do not support PMU v2 and thus efficient check for enabled PMCs is not possible => the patch speeds up vmexit for AMD Zen 4 CPUs only, this is sad. but the patch does not hurt other CPUs - and this is fortunate! i have no access to a node with AMD Zen 4 CPU, so i had to test on AMD Zen 3 CPU and i hope my expectations are right for AMD Zen 4. i would appreciate if anyone perform the test of a real AMD Zen 4 node. AMD performance results: CPU: AMD Zen 3 (three!): AMD EPYC 7443P 24-Core Processor * The test binary is run inside an AlmaLinux 9 VM with their stock kernel 5.14.0-284.11.1.el9_2.x86_64. * Test binary checks the CPUID instractions rate (instructions per sec). * Default VM config (PMU is off, pmu->version is reported as 1). * The Host runs the kernel under test. # for i in 1 2 3 4 5 ; do ./at_cpu_cpuid.pub ; done | \ awk -e '{print $4;}' | \ cut -f1 --delimiter='.' | \ ./avg.sh Measurements: 1. Host runs stock latest mainstream kernel commit 305230142ae0. 2. Host runs same mainstream kernel + current patch. 3. Host runs same mainstream kernel + current patch + force guest_pmu_is_enabled() to always return "false" using following change: - if (pmu->version >= 2 && !(pmu->global_ctrl & ~pmu->global_ctrl_mask)) + if (pmu->version == 1 && !(pmu->global_ctrl & ~pmu->global_ctrl_mask)) ----------------------------------------- | Kernels | CPUID rate | ----------------------------------------- | 1. | 1360250 | | 2. | 1365536 (+ 0.4%) | | 3. | 1541850 (+13.4%) | ----------------------------------------- Measurement (2) gives some fluctuation, the performance is not increased because the test was done on a Zen 3 CPU, so we are unable to use fast check for active PMCs. Measurement (3) shows expected performance boost on a Zen 4 CPU under the same test. The test used: # cat at_cpu_cpuid.pub.cpp /* * The test executes CPUID instruction in a loop and reports the calls rate. */ #include <stdio.h> #include <time.h> /* #define CPUID_EAX 0x80000002 */ #define CPUID_EAX 0x29a #define CPUID_ECX 0 #define TEST_EXEC_SECS 30 // in seconds #define LOOPS_APPROX_RATE 1000000 static inline void cpuid(unsigned int _eax, unsigned int _ecx) { unsigned int regs[4] = {_eax, 0, _ecx, 0}; asm __volatile__( "cpuid" : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3]) : "0" (regs[0]), "1" (regs[1]), "2" (regs[2]), "3" (regs[3]) : "memory"); } double cpuid_rate_loops(int loops_num) { int i; clock_t start_time, end_time; double spent_time, rate; start_time = clock(); for (i = 0; i < loops_num; i++) cpuid((unsigned int)CPUID_EAX, (unsigned int)CPUID_ECX); end_time = clock(); spent_time = (double)(end_time - start_time) / CLOCKS_PER_SEC; rate = (double)loops_num / spent_time; return rate; } int main(int argc, char* argv[]) { double approx_rate, rate; int loops; /* First we detect approximate CPUIDs rate. */ approx_rate = cpuid_rate_loops(LOOPS_APPROX_RATE); /* * How many loops there should be in order to run the test for * TEST_EXEC_SECS seconds? */ loops = (int)(approx_rate * TEST_EXEC_SECS); /* Get the precise instructions rate. */ rate = cpuid_rate_loops(loops); printf( "CPUID instructions rate: %f instructions/second\n", rate); return 0; } Konstantin Khorenko (1): KVM: x86/vPMU: Check PMU is enabled for vCPU before searching for PMC arch/x86/kvm/pmu.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) -- 2.39.3