Forking this off to try and avoid confusion... On Wed, Jun 12, 2024, Reinette Chatre wrote: > +/* > + * Possible TDCR values with matching divide count. Used to modify APIC > + * timer frequency. > + */ > +static struct { > + uint32_t tdcr; > + uint32_t divide_count; > +} tdcrs[] = { > + {0x0, 2}, > + {0x1, 4}, > + {0x2, 8}, > + {0x3, 16}, > + {0x8, 32}, > + {0x9, 64}, > + {0xa, 128}, > + {0xb, 1}, > +}; ... > +static void apic_guest_code(void) > +{ > + uint64_t tsc_hz = (uint64_t)tsc_khz * 1000; > + const uint32_t tmict = ~0u; > + uint64_t tsc0, tsc1, freq; > + uint32_t tmcct; > + int i; > + > + apic_enable(); > + > + /* > + * Setup one-shot timer. The vector does not matter because the > + * interrupt should not fire. > + */ > + apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED); > + > + for (i = 0; i < ARRAY_SIZE(tdcrs); i++) { > + > + apic_write_reg(APIC_TDCR, tdcrs[i].tdcr); > + apic_write_reg(APIC_TMICT, tmict); > + > + tsc0 = rdtsc(); > + udelay(delay_ms * 1000); > + tmcct = apic_read_reg(APIC_TMCCT); > + tsc1 = rdtsc(); > + > + /* > + * Stop the timer _after_ reading the current, final count, as > + * writing the initial counter also modifies the current count. > + */ > + apic_write_reg(APIC_TMICT, 0); > + > + freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0); > + /* Check if measured frequency is within 1% of configured frequency. */ > + GUEST_ASSERT(freq < apic_hz * 101 / 100); > + GUEST_ASSERT(freq > apic_hz * 99 / 100); > + } This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU bug. It looks like something APICv related is clobbering internal VMX timer state? Or maybe there's a tearing or truncation issue? As mentioned ad nauseum at this point, I'm offline all of next week, so hopefully there's enough info here to get a root cause... A spurious VM-Exit will occur after programming a vmcs.PREEMPTION_TIMER_VALUE that shouldn't exit. Every observed failure occurs when bits 27:16 are zero, with a small value in bits 15:0, e.g. VM-Enter with a timer value of 0xe0003bf7 or 0xa0006db6 will cause a near-immediate VM-Exit. Weirdly, it doesn't always happen, e.g. I've observed rollover from 0xa000xxxx to 0x9fffxxxx without failure. However, the *test* failure is 100% reproducible, i.e. it's not _that_ rare of a failure. So maybe there's state build-up required? E.g. in the failing cases, there are 10s of entries with a slightly larger timer value, with no preemption timer exit (the host's tick IRQ interrupts the guest, and then KVM reprograms the VMX timer). Even more sketchy, the failure only occurs if APICv is enabled. Turning off APICv makes the problem go away (I initially suspected KVM was somehow botching the TMCCT emulation). I am 99.9% certain there is no KVM APICv bug that is causing problems. Our IVB servers don't fail (even with APICv enabled), nor does my Raptor Lake client box (with APICv enabled). And I confirmed that the VMX timer is still getting programmed with the same sequence that fails when APICv is enabled. (Before I realized the pattern of values), I sanity checked the VMCS field just before VM-Enter, and again after VM-Exit (KVM runs without the CPU save vmcs.PREEMPTION_TIMER_VALUE on exit). I also verified the CPU thinks the timer has expired by enabling "save on exit" and verifying vmcs.PREEMPTION_TIMER_VALUE is indeed '0', and that KVM really did get a PREEMPTION_TIMER exit. Attached is the debug patch I used to get the below data (sort of; one of the post-exit prints is without saving vmcs.PREEMPTION_TIMER_VALUE on exit). In kvm_hypercall (ignore the name, I piggybacked a tracepoint because trying to log to dmesg was too slow, guest literally couldn't make forward progress), a1 is the VMX timer value programmed by KVM (0xe0003bf7). apic_bus_clock_-11419 [056] d..2. 146.771179: kvm_hypercall: nr 0x2c8a9e9cc6703c a0 0x2c8b0e9ce46c37 a1 0xe0003bf7 a2 0xe0003bf7 a3 0x7 apic_bus_clock_-11419 [056] d..2. 146.771242: kvm_exit: vcpu 0 reason PREEMPTION_TIMER rip 0x402065 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 VMX timer exit, VMCS = e0003bf7, delta = e0003690 and the post-exit print with the attached patch: kvm_intel: VMX timer exit, EXIT_REASON = 34, VMCS = 0, delta = e00037e9 And I've been fiddling with the below hack to coerce KVM into programming VMX timer values. Had I more time, I would have booted kernels with the ability to properly fuzz the timer values. Note, with TMICT=-1, only divide_count=1 will fail, because larger divide counts result in a timeout that doesn't fit into the 32-bit VMX timer field (don't ask me how long it took me to realize the divide count affects the time frequency, not the actual count, *sigh*). diff --git a/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c index f8916bb34405..4eb49e20ff9c 100644 --- a/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c +++ b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c @@ -22,13 +22,13 @@ static const struct { const uint32_t tdcr; const uint32_t divide_count; } tdcrs[] = { - {0x0, 2}, - {0x1, 4}, - {0x2, 8}, - {0x3, 16}, - {0x8, 32}, - {0x9, 64}, - {0xa, 128}, + // {0x0, 2}, + // {0x1, 4}, + // {0x2, 8}, + // {0x3, 16}, + // {0x8, 32}, + // {0x9, 64}, + // {0xa, 128}, {0xb, 1}, }; @@ -55,10 +55,11 @@ static void apic_write_reg(unsigned int reg, uint32_t val) xapic_write_reg(reg, val); } +uint32_t tmict = ~0u; + static void apic_guest_code(uint64_t apic_hz, uint64_t delay_ms) { uint64_t tsc_hz = guest_tsc_khz * 1000; - const uint32_t tmict = ~0u; uint64_t tsc0, tsc1, freq; uint32_t tmcct; int i; @@ -133,6 +134,7 @@ static void run_apic_bus_clock_test(uint64_t apic_hz, uint64_t delay_ms, vm = vm_create(1); sync_global_to_guest(vm, is_x2apic); + sync_global_to_guest(vm, tmict); vm_enable_cap(vm, KVM_CAP_X86_APIC_BUS_CYCLES_NS, NSEC_PER_SEC / apic_hz); @@ -174,7 +176,7 @@ int main(int argc, char *argv[]) TEST_REQUIRE(kvm_has_cap(KVM_CAP_X86_APIC_BUS_CYCLES_NS)); - while ((opt = getopt(argc, argv, "d:f:h")) != -1) { + while ((opt = getopt(argc, argv, "d:f:i:h")) != -1) { switch (opt) { case 'f': apic_hz = atoi_positive("APIC bus frequency", optarg) * 1000 * 1000; @@ -182,6 +184,9 @@ int main(int argc, char *argv[]) case 'd': delay_ms = atoi_positive("Delay in milliseconds", optarg); break; + case 'i': + tmict = ~0u - atoi_positive("offset", optarg); + break; case 'h': default: help(argv[0]);