Re: [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail

"Mi, Dapeng" <dapeng1.mi@xxxxxxxxxxxxxxx> · Wed, 15 Jan 2025 10:44:43 +0800

On 1/15/2025 3:47 AM, Sean Christopherson wrote:
> +Dapeng
>
> On Tue, Jan 14, 2025, kernel test robot wrote:
>> we fould the test failed on a Cooper Lake, not sure if this is expected.
>> below full report FYI.
>>
>>
>> kernel test robot noticed "kernel-selftests.kvm.pmu_counters_test.fail" on:
>>
>> commit: 7803339fa929387bbc66479532afbaf8cbebb41b ("KVM: selftests: Use data load to trigger LLC references/misses in Intel PMU")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> [test failed on linux-next/master 37136bf5c3a6f6b686d74f41837a6406bec6b7bc]
>>
>> in testcase: kernel-selftests
>> version: kernel-selftests-x86_64-7503345ac5f5-1_20241208
>> with following parameters:
>>
>> 	group: kvm
>>
>> config: x86_64-rhel-9.4-kselftests
>> compiler: gcc-12
>> test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
> *sigh*
>
> This fails on our Skylake and Cascade Lake systems, but I only tested an Emerald
> Rapids.
>
>> # Testing fixed counters, PMU version 0, perf_caps = 2000
>> # Testing arch events, PMU version 1, perf_caps = 0
>> # ==== Test Assertion Failure ====
>> #   x86/pmu_counters_test.c:129: count >= (10 * 4 + 5)
>> #   pid=6278 tid=6278 errno=4 - Interrupted system call
>> #      1	0x0000000000411281: assert_on_unhandled_exception at processor.c:625
>> #      2	0x00000000004075d4: _vcpu_run at kvm_util.c:1652
>> #      3	 (inlined by) vcpu_run at kvm_util.c:1663
>> #      4	0x0000000000402c5e: run_vcpu at pmu_counters_test.c:62
>> #      5	0x0000000000402e4d: test_arch_events at pmu_counters_test.c:315
>> #      6	0x0000000000402663: test_arch_events at pmu_counters_test.c:304
>> #      7	 (inlined by) test_intel_counters at pmu_counters_test.c:609
>> #      8	 (inlined by) main at pmu_counters_test.c:642
>> #      9	0x00007f3b134f9249: ?? ??:0
>> #     10	0x00007f3b134f9304: ?? ??:0
>> #     11	0x0000000000402900: _start at ??:?
>> #   count >= NUM_INSNS_RETIRED
> The failure is on top-down slots.  I modified the assert to actually print the
> count (I'll make sure to post a patch regardless of where this goes), and based
> on the count for failing vs. passing, I'm pretty sure the issue is not the extra
> instruction, but instead is due to changing the target of the CLFUSH from the
> address of the code to the address of kvm_pmu_version.
>
> However, I think the blame lies with the assertion itself, i.e. with commit
> 4a447b135e45 ("KVM: selftests: Test top-down slots event in x86's pmu_counters_test").
> Either that or top-down slots is broken on the Lakes.
>
> By my rudimentary measurements, tying the number of available slots to the number
> of instructions *retired* is fundamentally flawed.  E.g. on the Lakes (SKX is more
> or less identical to CLX), omitting the CLFLUSHOPT entirely results in *more*
> slots being available throughout the lifetime of the measured section.
>
> My best guess is that flushing the cache line use for the data load causes the
> backend to saturate its slots with prefetching data, and as a result the number
> of slots that are available goes down.
>
>         CLFLUSHOPT .    | CLFLUSHOPT [%m]       | NOP
> CLX     350-100         | 20-60[*]              | 135-150  
> SPR     49000-57000     | 32500-41000           | 6760-6830
>
> [*] CLX had a few outliers in the 200-400 range, but the majority of runs were
>     in the 20-60 range.
>
> Reading through more (and more and more) of the TMA documentation, I don't think
> we can assume anything about the number of available slots, beyond a very basic
> assertion that it's practically impossible for there to never be an available
> slot.  IIUC, retiring an instruction does NOT require an available slot, rather
> it requires the opposite: an occupied slot for the uop(s).

I'm not quite sure about this. IIUC, retiring an instruction may not need a
cycle, but it needs a slot at least except the instruction is macro-fused.
Anyway, let me double check with our micro-architecture and perf experts.

>
> I'm mildly curious as to why the counts for SPR are orders of magnitude higher
> that CLX (simple accounting differences?), but I don't think it changes anything
> in the test itself.
>
> Unless someone has a better idea, my plan is to post a patch to assert that the
> top-down slots count is non-zero, not that it's >= instructions retired.  E.g.
>
> diff --git a/tools/testing/selftests/kvm/x86/pmu_counters_test.c b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> index accd7ecd3e5f..21acedcd46cd 100644
> --- a/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> +++ b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> @@ -123,10 +123,8 @@ static void guest_assert_event_count(uint8_t idx,
>                 fallthrough;
>         case INTEL_ARCH_CPU_CYCLES_INDEX:
>         case INTEL_ARCH_REFERENCE_CYCLES_INDEX:
> -               GUEST_ASSERT_NE(count, 0);
> -               break;
>         case INTEL_ARCH_TOPDOWN_SLOTS_INDEX:
> -               GUEST_ASSERT(count >= NUM_INSNS_RETIRED);
> +               GUEST_ASSERT_NE(count, 0);
>                 break;
>         default:
>                 break;
>