Re: [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail

"Mi, Dapeng" <dapeng1.mi@xxxxxxxxxxxxxxx> · Fri, 17 Jan 2025 11:04:35 +0800

On 1/15/2025 10:44 AM, Mi, Dapeng wrote:
> On 1/15/2025 3:47 AM, Sean Christopherson wrote:
>> +Dapeng
>>
>> On Tue, Jan 14, 2025, kernel test robot wrote:
>>> we fould the test failed on a Cooper Lake, not sure if this is expected.
>>> below full report FYI.
>>>
>>>
>>> kernel test robot noticed "kernel-selftests.kvm.pmu_counters_test.fail" on:
>>>
>>> commit: 7803339fa929387bbc66479532afbaf8cbebb41b ("KVM: selftests: Use data load to trigger LLC references/misses in Intel PMU")
>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>>
>>> [test failed on linux-next/master 37136bf5c3a6f6b686d74f41837a6406bec6b7bc]
>>>
>>> in testcase: kernel-selftests
>>> version: kernel-selftests-x86_64-7503345ac5f5-1_20241208
>>> with following parameters:
>>>
>>> 	group: kvm
>>>
>>> config: x86_64-rhel-9.4-kselftests
>>> compiler: gcc-12
>>> test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
>> *sigh*
>>
>> This fails on our Skylake and Cascade Lake systems, but I only tested an Emerald
>> Rapids.
>>
>>> # Testing fixed counters, PMU version 0, perf_caps = 2000
>>> # Testing arch events, PMU version 1, perf_caps = 0
>>> # ==== Test Assertion Failure ====
>>> #   x86/pmu_counters_test.c:129: count >= (10 * 4 + 5)
>>> #   pid=6278 tid=6278 errno=4 - Interrupted system call
>>> #      1	0x0000000000411281: assert_on_unhandled_exception at processor.c:625
>>> #      2	0x00000000004075d4: _vcpu_run at kvm_util.c:1652
>>> #      3	 (inlined by) vcpu_run at kvm_util.c:1663
>>> #      4	0x0000000000402c5e: run_vcpu at pmu_counters_test.c:62
>>> #      5	0x0000000000402e4d: test_arch_events at pmu_counters_test.c:315
>>> #      6	0x0000000000402663: test_arch_events at pmu_counters_test.c:304
>>> #      7	 (inlined by) test_intel_counters at pmu_counters_test.c:609
>>> #      8	 (inlined by) main at pmu_counters_test.c:642
>>> #      9	0x00007f3b134f9249: ?? ??:0
>>> #     10	0x00007f3b134f9304: ?? ??:0
>>> #     11	0x0000000000402900: _start at ??:?
>>> #   count >= NUM_INSNS_RETIRED
>> The failure is on top-down slots.  I modified the assert to actually print the
>> count (I'll make sure to post a patch regardless of where this goes), and based
>> on the count for failing vs. passing, I'm pretty sure the issue is not the extra
>> instruction, but instead is due to changing the target of the CLFUSH from the
>> address of the code to the address of kvm_pmu_version.
>>
>> However, I think the blame lies with the assertion itself, i.e. with commit
>> 4a447b135e45 ("KVM: selftests: Test top-down slots event in x86's pmu_counters_test").
>> Either that or top-down slots is broken on the Lakes.
>>
>> By my rudimentary measurements, tying the number of available slots to the number
>> of instructions *retired* is fundamentally flawed.  E.g. on the Lakes (SKX is more
>> or less identical to CLX), omitting the CLFLUSHOPT entirely results in *more*
>> slots being available throughout the lifetime of the measured section.
>>
>> My best guess is that flushing the cache line use for the data load causes the
>> backend to saturate its slots with prefetching data, and as a result the number
>> of slots that are available goes down.
>>
>>         CLFLUSHOPT .    | CLFLUSHOPT [%m]       | NOP
>> CLX     350-100         | 20-60[*]              | 135-150  
>> SPR     49000-57000     | 32500-41000           | 6760-6830
>>
>> [*] CLX had a few outliers in the 200-400 range, but the majority of runs were
>>     in the 20-60 range.
>>
>> Reading through more (and more and more) of the TMA documentation, I don't think
>> we can assume anything about the number of available slots, beyond a very basic
>> assertion that it's practically impossible for there to never be an available
>> slot.  IIUC, retiring an instruction does NOT require an available slot, rather
>> it requires the opposite: an occupied slot for the uop(s).
> I'm not quite sure about this. IIUC, retiring an instruction may not need a
> cycle, but it needs a slot at least except the instruction is macro-fused.
> Anyway, let me double check with our micro-architecture and perf experts.

Hi Sean,

Just double check with our perf experts, the understanding that "retiring
an instruction needs a slot at least except the instruction is macro-fused"
is correct. The reason of this error is that the architectural topdown
slots event is just introduced from Ice lake platform and it's not
supported on skylake and cascade lake platforms. On these earlier platforms
the event 0x01a4 is another event which counts different thing instead of
topdown slots. On these earlier platforms, the slots event is derived from
0x3c event.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/events/intel/core.c#n466

I don't understand why current pmu_counters_test code would validate an
architectural event which KVM (HW) doesn't support, it's not reasonable and
could cause misleading.

        /*
         * Force iterating over known arch events regardless of whether or not
         * KVM/hardware supports a given event.
         */
        nr_arch_events = max_t(typeof(nr_arch_events), nr_arch_events,
NR_INTEL_ARCH_EVENTS);

I would provide a patch to fix this.

BTW, currently KVM doesn't check if user space sets a valid pmu version in
kvm_check_cpuid(). The user space could set KVM a PMU version which is
larger than KVM supported maximum PMU version, just like currently
pmu_counters_test does. This is not correct. I originally intent to fix
this with the mediated vPMU patch series, but It looks we can send the
patches just with this fix together, then the issue can be fixed earlier.

Thanks,

Dapeng Mi

>
>
>> I'm mildly curious as to why the counts for SPR are orders of magnitude higher
>> that CLX (simple accounting differences?), but I don't think it changes anything
>> in the test itself.
>>
>> Unless someone has a better idea, my plan is to post a patch to assert that the
>> top-down slots count is non-zero, not that it's >= instructions retired.  E.g.
>>
>> diff --git a/tools/testing/selftests/kvm/x86/pmu_counters_test.c b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
>> index accd7ecd3e5f..21acedcd46cd 100644
>> --- a/tools/testing/selftests/kvm/x86/pmu_counters_test.c
>> +++ b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
>> @@ -123,10 +123,8 @@ static void guest_assert_event_count(uint8_t idx,
>>                 fallthrough;
>>         case INTEL_ARCH_CPU_CYCLES_INDEX:
>>         case INTEL_ARCH_REFERENCE_CYCLES_INDEX:
>> -               GUEST_ASSERT_NE(count, 0);
>> -               break;
>>         case INTEL_ARCH_TOPDOWN_SLOTS_INDEX:
>> -               GUEST_ASSERT(count >= NUM_INSNS_RETIRED);
>> +               GUEST_ASSERT_NE(count, 0);
>>                 break;
>>         default:
>>                 break;
>>