Re: [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



+Dapeng

On Tue, Jan 14, 2025, kernel test robot wrote:
> we fould the test failed on a Cooper Lake, not sure if this is expected.
> below full report FYI.
> 
> 
> kernel test robot noticed "kernel-selftests.kvm.pmu_counters_test.fail" on:
> 
> commit: 7803339fa929387bbc66479532afbaf8cbebb41b ("KVM: selftests: Use data load to trigger LLC references/misses in Intel PMU")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> [test failed on linux-next/master 37136bf5c3a6f6b686d74f41837a6406bec6b7bc]
> 
> in testcase: kernel-selftests
> version: kernel-selftests-x86_64-7503345ac5f5-1_20241208
> with following parameters:
> 
> 	group: kvm
> 
> config: x86_64-rhel-9.4-kselftests
> compiler: gcc-12
> test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory

*sigh*

This fails on our Skylake and Cascade Lake systems, but I only tested an Emerald
Rapids.

> # Testing fixed counters, PMU version 0, perf_caps = 2000
> # Testing arch events, PMU version 1, perf_caps = 0
> # ==== Test Assertion Failure ====
> #   x86/pmu_counters_test.c:129: count >= (10 * 4 + 5)
> #   pid=6278 tid=6278 errno=4 - Interrupted system call
> #      1	0x0000000000411281: assert_on_unhandled_exception at processor.c:625
> #      2	0x00000000004075d4: _vcpu_run at kvm_util.c:1652
> #      3	 (inlined by) vcpu_run at kvm_util.c:1663
> #      4	0x0000000000402c5e: run_vcpu at pmu_counters_test.c:62
> #      5	0x0000000000402e4d: test_arch_events at pmu_counters_test.c:315
> #      6	0x0000000000402663: test_arch_events at pmu_counters_test.c:304
> #      7	 (inlined by) test_intel_counters at pmu_counters_test.c:609
> #      8	 (inlined by) main at pmu_counters_test.c:642
> #      9	0x00007f3b134f9249: ?? ??:0
> #     10	0x00007f3b134f9304: ?? ??:0
> #     11	0x0000000000402900: _start at ??:?
> #   count >= NUM_INSNS_RETIRED

The failure is on top-down slots.  I modified the assert to actually print the
count (I'll make sure to post a patch regardless of where this goes), and based
on the count for failing vs. passing, I'm pretty sure the issue is not the extra
instruction, but instead is due to changing the target of the CLFUSH from the
address of the code to the address of kvm_pmu_version.

However, I think the blame lies with the assertion itself, i.e. with commit
4a447b135e45 ("KVM: selftests: Test top-down slots event in x86's pmu_counters_test").
Either that or top-down slots is broken on the Lakes.

By my rudimentary measurements, tying the number of available slots to the number
of instructions *retired* is fundamentally flawed.  E.g. on the Lakes (SKX is more
or less identical to CLX), omitting the CLFLUSHOPT entirely results in *more*
slots being available throughout the lifetime of the measured section.

My best guess is that flushing the cache line use for the data load causes the
backend to saturate its slots with prefetching data, and as a result the number
of slots that are available goes down.

        CLFLUSHOPT .    | CLFLUSHOPT [%m]       | NOP
CLX     350-100         | 20-60[*]              | 135-150  
SPR     49000-57000     | 32500-41000           | 6760-6830

[*] CLX had a few outliers in the 200-400 range, but the majority of runs were
    in the 20-60 range.

Reading through more (and more and more) of the TMA documentation, I don't think
we can assume anything about the number of available slots, beyond a very basic
assertion that it's practically impossible for there to never be an available
slot.  IIUC, retiring an instruction does NOT require an available slot, rather
it requires the opposite: an occupied slot for the uop(s).

I'm mildly curious as to why the counts for SPR are orders of magnitude higher
that CLX (simple accounting differences?), but I don't think it changes anything
in the test itself.

Unless someone has a better idea, my plan is to post a patch to assert that the
top-down slots count is non-zero, not that it's >= instructions retired.  E.g.

diff --git a/tools/testing/selftests/kvm/x86/pmu_counters_test.c b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
index accd7ecd3e5f..21acedcd46cd 100644
--- a/tools/testing/selftests/kvm/x86/pmu_counters_test.c
+++ b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
@@ -123,10 +123,8 @@ static void guest_assert_event_count(uint8_t idx,
                fallthrough;
        case INTEL_ARCH_CPU_CYCLES_INDEX:
        case INTEL_ARCH_REFERENCE_CYCLES_INDEX:
-               GUEST_ASSERT_NE(count, 0);
-               break;
        case INTEL_ARCH_TOPDOWN_SLOTS_INDEX:
-               GUEST_ASSERT(count >= NUM_INSNS_RETIRED);
+               GUEST_ASSERT_NE(count, 0);
                break;
        default:
                break;





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux