Re: [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail

"Mi, Dapeng" <dapeng1.mi@xxxxxxxxxxxxxxx> · Mon, 20 Jan 2025 10:02:57 +0800

On 1/18/2025 1:11 AM, Sean Christopherson wrote:
> On Fri, Jan 17, 2025, Dapeng Mi wrote:
>> On 1/15/2025 10:44 AM, Mi, Dapeng wrote:
>>> On 1/15/2025 3:47 AM, Sean Christopherson wrote:
>>>>> # Testing fixed counters, PMU version 0, perf_caps = 2000
>>>>> # Testing arch events, PMU version 1, perf_caps = 0
>>>>> # ==== Test Assertion Failure ====
>>>>> #   x86/pmu_counters_test.c:129: count >= (10 * 4 + 5)
>>>>> #   pid=6278 tid=6278 errno=4 - Interrupted system call
>>>>> #      1	0x0000000000411281: assert_on_unhandled_exception at processor.c:625
>>>>> #      2	0x00000000004075d4: _vcpu_run at kvm_util.c:1652
>>>>> #      3	 (inlined by) vcpu_run at kvm_util.c:1663
>>>>> #      4	0x0000000000402c5e: run_vcpu at pmu_counters_test.c:62
>>>>> #      5	0x0000000000402e4d: test_arch_events at pmu_counters_test.c:315
>>>>> #      6	0x0000000000402663: test_arch_events at pmu_counters_test.c:304
>>>>> #      7	 (inlined by) test_intel_counters at pmu_counters_test.c:609
>>>>> #      8	 (inlined by) main at pmu_counters_test.c:642
>>>>> #      9	0x00007f3b134f9249: ?? ??:0
>>>>> #     10	0x00007f3b134f9304: ?? ??:0
>>>>> #     11	0x0000000000402900: _start at ??:?
>>>>> #   count >= NUM_INSNS_RETIRED
>>>> The failure is on top-down slots.  I modified the assert to actually print the
>>>> count (I'll make sure to post a patch regardless of where this goes), and based
>>>> on the count for failing vs. passing, I'm pretty sure the issue is not the extra
>>>> instruction, but instead is due to changing the target of the CLFUSH from the
>>>> address of the code to the address of kvm_pmu_version.
>>>>
>>>> However, I think the blame lies with the assertion itself, i.e. with commit
>>>> 4a447b135e45 ("KVM: selftests: Test top-down slots event in x86's pmu_counters_test").
>>>> Either that or top-down slots is broken on the Lakes.
>>>>
>>>> By my rudimentary measurements, tying the number of available slots to the number
>>>> of instructions *retired* is fundamentally flawed.  E.g. on the Lakes (SKX is more
>>>> or less identical to CLX), omitting the CLFLUSHOPT entirely results in *more*
>>>> slots being available throughout the lifetime of the measured section.
>>>>
>>>> My best guess is that flushing the cache line use for the data load causes the
>>>> backend to saturate its slots with prefetching data, and as a result the number
>>>> of slots that are available goes down.
>>>>
>>>>         CLFLUSHOPT .    | CLFLUSHOPT [%m]       | NOP
>>>> CLX     350-100         | 20-60[*]              | 135-150  
>>>> SPR     49000-57000     | 32500-41000           | 6760-6830
>>>>
>>>> [*] CLX had a few outliers in the 200-400 range, but the majority of runs were
>>>>     in the 20-60 range.
>>>>
>>>> Reading through more (and more and more) of the TMA documentation, I don't think
>>>> we can assume anything about the number of available slots, beyond a very basic
>>>> assertion that it's practically impossible for there to never be an available
>>>> slot.  IIUC, retiring an instruction does NOT require an available slot, rather
>>>> it requires the opposite: an occupied slot for the uop(s).
>>> I'm not quite sure about this. IIUC, retiring an instruction may not need a
>>> cycle, but it needs a slot at least except the instruction is macro-fused.
>>> Anyway, let me double check with our micro-architecture and perf experts.
>> Hi Sean,
>>
>> Just double check with our perf experts, the understanding that "retiring
>> an instruction needs a slot at least except the instruction is macro-fused"
>> is correct. The reason of this error is that the architectural topdown
>> slots event is just introduced from Ice lake platform and it's not
>> supported on skylake and cascade lake platforms. On these earlier platforms
>> the event 0x01a4 is another event which counts different thing instead of
>> topdown slots. On these earlier platforms, the slots event is derived from
>> 0x3c event.
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/events/intel/core.c#n466
>>
>>
>> I don't understand why current pmu_counters_test code would validate an
>> architectural event which KVM (HW) doesn't support, it's not reasonable and
>> could cause misleading.
>>
>>         /*
>>          * Force iterating over known arch events regardless of whether or not
>>          * KVM/hardware supports a given event.
>>          */
>>         nr_arch_events = max_t(typeof(nr_arch_events), nr_arch_events,
>> NR_INTEL_ARCH_EVENTS);
> /facepalm
>
> That's hilariously obvious in hindsight.
>
>> I would provide a patch to fix this.
> Testing "unsupproted" arch events is intentional.  The idea is to validate that
> KVM programs the requested event selector irrespective of whether or not the
> architectural event is *enumerated* to the guest (old KVM incorrectly "filtered"
> such events).
>
> The flaw in the test is that it doesn't check if the architectural event is
> supported in *hardware* when validating the count.  But it's still desirable to
> program the event selector in that case, i.e. only the validation of the count
> should be skipped.  Diff at the bottom to address this (needs to be spread
> over multiple patches).

I see. Thanks for explaining the history.

>
>> BTW, currently KVM doesn't check if user space sets a valid pmu version in
>> kvm_check_cpuid(). The user space could set KVM a PMU version which is
>> larger than KVM supported maximum PMU version, just like currently
>> pmu_counters_test does. This is not correct.
> It's "correct" in the sense that KVM typically doesn't restrict what userspace
> enumerates to the guest through CPUID.  KVM needs to protect the host against
> doing bad things based on a funky guest CPUID, e.g. KVM needs t
>
>> I originally intent to fix this with the mediated vPMU patch series, but It
>> looks we can send the patches just with this fix together, then the issue can
>> be fixed earlier.
> I suspect that if there's a flaw in KVM, it only affects the mediated PMU.  Because
> perf manages MSRs/hardware with the current PMU, advertising a bogus PMU version
> to the guest is "fine" because even if KVM thinks it's legal for the *guest* to
> write MSRs that don't exist in hardware, KVM/perf will never try to propagate the
> guest values to non-existent hardware.

Hmm, yeah, I think you're right. In theory, it should only impact mediated
vPMU since perf-based vPMU doesn't manipulate HW directly. Thanks.

>
> diff --git a/tools/testing/selftests/kvm/x86/pmu_counters_test.c b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> index accd7ecd3e5f..124051ea50be 100644
> --- a/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> +++ b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> @@ -29,10 +29,60 @@
>  /* Total number of instructions retired within the measured section. */
>  #define NUM_INSNS_RETIRED		(NUM_LOOPS * NUM_INSNS_PER_LOOP + NUM_EXTRA_INSNS)
>  
> +/* Track which architectural events are supported by hardware. */
> +static uint32_t hardware_pmu_arch_events;
>  
>  static uint8_t kvm_pmu_version;
>  static bool kvm_has_perf_caps;
>  
> +
> +#define X86_PMU_FEATURE_NULL						\
> +({									\
> +	struct kvm_x86_pmu_feature feature = {};			\
> +									\
> +	feature;							\
> +})
> +
> +static bool pmu_is_null_feature(struct kvm_x86_pmu_feature event)
> +{
> +	return !(*(u64 *)&event);
> +}
> +
> +struct kvm_intel_pmu_event {
> +	struct kvm_x86_pmu_feature gp_event;
> +	struct kvm_x86_pmu_feature fixed_event;
> +};
> +
> +/*
> + * Wrap the array to appease the compiler, as the macros used to construct each
> + * kvm_x86_pmu_feature use syntax that's only valid in function scope, and the
> + * compiler often thinks the feature definitions aren't compile-time constants.
> + */
> +static struct kvm_intel_pmu_event intel_event_to_feature(uint8_t idx)
> +{
> +	const struct kvm_intel_pmu_event __intel_event_to_feature[] = {
> +		[INTEL_ARCH_CPU_CYCLES_INDEX]		 = { X86_PMU_FEATURE_CPU_CYCLES, X86_PMU_FEATURE_CPU_CYCLES_FIXED },
> +		[INTEL_ARCH_INSTRUCTIONS_RETIRED_INDEX]	 = { X86_PMU_FEATURE_INSNS_RETIRED, X86_PMU_FEATURE_INSNS_RETIRED_FIXED },
> +		/*
> +		* Note, the fixed counter for reference cycles is NOT the same as the
> +		* general purpose architectural event.  The fixed counter explicitly
> +		* counts at the same frequency as the TSC, whereas the GP event counts
> +		* at a fixed, but uarch specific, frequency.  Bundle them here for
> +		* simplicity.
> +		*/
> +		[INTEL_ARCH_REFERENCE_CYCLES_INDEX]	 = { X86_PMU_FEATURE_REFERENCE_CYCLES, X86_PMU_FEATURE_REFERENCE_TSC_CYCLES_FIXED },
> +		[INTEL_ARCH_LLC_REFERENCES_INDEX]	 = { X86_PMU_FEATURE_LLC_REFERENCES, X86_PMU_FEATURE_NULL },
> +		[INTEL_ARCH_LLC_MISSES_INDEX]		 = { X86_PMU_FEATURE_LLC_MISSES, X86_PMU_FEATURE_NULL },
> +		[INTEL_ARCH_BRANCHES_RETIRED_INDEX]	 = { X86_PMU_FEATURE_BRANCH_INSNS_RETIRED, X86_PMU_FEATURE_NULL },
> +		[INTEL_ARCH_BRANCHES_MISPREDICTED_INDEX] = { X86_PMU_FEATURE_BRANCHES_MISPREDICTED, X86_PMU_FEATURE_NULL },
> +		[INTEL_ARCH_TOPDOWN_SLOTS_INDEX]	 = { X86_PMU_FEATURE_TOPDOWN_SLOTS, X86_PMU_FEATURE_TOPDOWN_SLOTS_FIXED },
> +	};
> +
> +	kvm_static_assert(ARRAY_SIZE(__intel_event_to_feature) == NR_INTEL_ARCH_EVENTS);
> +
> +	return __intel_event_to_feature[idx];
> +}
> +
>  static struct kvm_vm *pmu_vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
>  						  void *guest_code,
>  						  uint8_t pmu_version,
> @@ -42,6 +92,7 @@ static struct kvm_vm *pmu_vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
>  
>  	vm = vm_create_with_one_vcpu(vcpu, guest_code);
>  	sync_global_to_guest(vm, kvm_pmu_version);
> +	sync_global_to_guest(vm, hardware_pmu_arch_events);
>  
>  	/*
>  	 * Set PERF_CAPABILITIES before PMU version as KVM disallows enabling
> @@ -98,14 +149,12 @@ static uint8_t guest_get_pmu_version(void)
>   * Sanity check that in all cases, the event doesn't count when it's disabled,
>   * and that KVM correctly emulates the write of an arbitrary value.
>   */
> -static void guest_assert_event_count(uint8_t idx,
> -				     struct kvm_x86_pmu_feature event,
> -				     uint32_t pmc, uint32_t pmc_msr)
> +static void guest_assert_event_count(uint8_t idx, uint32_t pmc, uint32_t pmc_msr)
>  {
>  	uint64_t count;
>  
>  	count = _rdpmc(pmc);
> -	if (!this_pmu_has(event))
> +	if (!(hardware_pmu_arch_events & BIT(idx)))
>  		goto sanity_checks;
>  
>  	switch (idx) {
> @@ -126,7 +175,9 @@ static void guest_assert_event_count(uint8_t idx,
>  		GUEST_ASSERT_NE(count, 0);
>  		break;
>  	case INTEL_ARCH_TOPDOWN_SLOTS_INDEX:
> -		GUEST_ASSERT(count >= NUM_INSNS_RETIRED);
> +		__GUEST_ASSERT(count < NUM_INSNS_RETIRED,

shouldn't be "__GUEST_ASSERT(count >= NUM_INSNS_RETIRED," ?

> +			       "Expected top-down slots >= %u, got count = %lu",
> +			       NUM_INSNS_RETIRED, count);
>  		break;
>  	default:
>  		break;
> @@ -173,7 +224,7 @@ do {										\
>  	);									\
>  } while (0)
>  
> -#define GUEST_TEST_EVENT(_idx, _event, _pmc, _pmc_msr, _ctrl_msr, _value, FEP)	\
> +#define GUEST_TEST_EVENT(_idx, _pmc, _pmc_msr, _ctrl_msr, _value, FEP)		\
>  do {										\
>  	wrmsr(_pmc_msr, 0);							\
>  										\
> @@ -184,54 +235,20 @@ do {										\
>  	else									\
>  		GUEST_MEASURE_EVENT(_ctrl_msr, _value, "nop", FEP);		\
>  										\
> -	guest_assert_event_count(_idx, _event, _pmc, _pmc_msr);			\
> +	guest_assert_event_count(_idx, _pmc, _pmc_msr);				\
>  } while (0)
>  
> -static void __guest_test_arch_event(uint8_t idx, struct kvm_x86_pmu_feature event,
> -				    uint32_t pmc, uint32_t pmc_msr,
> +static void __guest_test_arch_event(uint8_t idx, uint32_t pmc, uint32_t pmc_msr,
>  				    uint32_t ctrl_msr, uint64_t ctrl_msr_value)
>  {
> -	GUEST_TEST_EVENT(idx, event, pmc, pmc_msr, ctrl_msr, ctrl_msr_value, "");
> +	GUEST_TEST_EVENT(idx, pmc, pmc_msr, ctrl_msr, ctrl_msr_value, "");
>  
>  	if (is_forced_emulation_enabled)
> -		GUEST_TEST_EVENT(idx, event, pmc, pmc_msr, ctrl_msr, ctrl_msr_value, KVM_FEP);
> -}
> -
> -#define X86_PMU_FEATURE_NULL						\
> -({									\
> -	struct kvm_x86_pmu_feature feature = {};			\
> -									\
> -	feature;							\
> -})
> -
> -static bool pmu_is_null_feature(struct kvm_x86_pmu_feature event)
> -{
> -	return !(*(u64 *)&event);
> +		GUEST_TEST_EVENT(idx, pmc, pmc_msr, ctrl_msr, ctrl_msr_value, KVM_FEP);
>  }
>  
>  static void guest_test_arch_event(uint8_t idx)
>  {
> -	const struct {
> -		struct kvm_x86_pmu_feature gp_event;
> -		struct kvm_x86_pmu_feature fixed_event;
> -	} intel_event_to_feature[] = {
> -		[INTEL_ARCH_CPU_CYCLES_INDEX]		 = { X86_PMU_FEATURE_CPU_CYCLES, X86_PMU_FEATURE_CPU_CYCLES_FIXED },
> -		[INTEL_ARCH_INSTRUCTIONS_RETIRED_INDEX]	 = { X86_PMU_FEATURE_INSNS_RETIRED, X86_PMU_FEATURE_INSNS_RETIRED_FIXED },
> -		/*
> -		 * Note, the fixed counter for reference cycles is NOT the same
> -		 * as the general purpose architectural event.  The fixed counter
> -		 * explicitly counts at the same frequency as the TSC, whereas
> -		 * the GP event counts at a fixed, but uarch specific, frequency.
> -		 * Bundle them here for simplicity.
> -		 */
> -		[INTEL_ARCH_REFERENCE_CYCLES_INDEX]	 = { X86_PMU_FEATURE_REFERENCE_CYCLES, X86_PMU_FEATURE_REFERENCE_TSC_CYCLES_FIXED },
> -		[INTEL_ARCH_LLC_REFERENCES_INDEX]	 = { X86_PMU_FEATURE_LLC_REFERENCES, X86_PMU_FEATURE_NULL },
> -		[INTEL_ARCH_LLC_MISSES_INDEX]		 = { X86_PMU_FEATURE_LLC_MISSES, X86_PMU_FEATURE_NULL },
> -		[INTEL_ARCH_BRANCHES_RETIRED_INDEX]	 = { X86_PMU_FEATURE_BRANCH_INSNS_RETIRED, X86_PMU_FEATURE_NULL },
> -		[INTEL_ARCH_BRANCHES_MISPREDICTED_INDEX] = { X86_PMU_FEATURE_BRANCHES_MISPREDICTED, X86_PMU_FEATURE_NULL },
> -		[INTEL_ARCH_TOPDOWN_SLOTS_INDEX]	 = { X86_PMU_FEATURE_TOPDOWN_SLOTS, X86_PMU_FEATURE_TOPDOWN_SLOTS_FIXED },
> -	};
> -
>  	uint32_t nr_gp_counters = this_cpu_property(X86_PROPERTY_PMU_NR_GP_COUNTERS);
>  	uint32_t pmu_version = guest_get_pmu_version();
>  	/* PERF_GLOBAL_CTRL exists only for Architectural PMU Version 2+. */
> @@ -249,7 +266,7 @@ static void guest_test_arch_event(uint8_t idx)
>  	else
>  		base_pmc_msr = MSR_IA32_PERFCTR0;
>  
> -	gp_event = intel_event_to_feature[idx].gp_event;
> +	gp_event = intel_event_to_feature(idx).gp_event;
>  	GUEST_ASSERT_EQ(idx, gp_event.f.bit);
>  
>  	GUEST_ASSERT(nr_gp_counters);
> @@ -263,14 +280,14 @@ static void guest_test_arch_event(uint8_t idx)
>  		if (guest_has_perf_global_ctrl)
>  			wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, BIT_ULL(i));
>  
> -		__guest_test_arch_event(idx, gp_event, i, base_pmc_msr + i,
> +		__guest_test_arch_event(idx, i, base_pmc_msr + i,
>  					MSR_P6_EVNTSEL0 + i, eventsel);
>  	}
>  
>  	if (!guest_has_perf_global_ctrl)
>  		return;
>  
> -	fixed_event = intel_event_to_feature[idx].fixed_event;
> +	fixed_event = intel_event_to_feature(idx).fixed_event;
>  	if (pmu_is_null_feature(fixed_event) || !this_pmu_has(fixed_event))
>  		return;
>  
> @@ -278,7 +295,7 @@ static void guest_test_arch_event(uint8_t idx)
>  
>  	wrmsr(MSR_CORE_PERF_FIXED_CTR_CTRL, FIXED_PMC_CTRL(i, FIXED_PMC_KERNEL));
>  
> -	__guest_test_arch_event(idx, fixed_event, i | INTEL_RDPMC_FIXED,
> +	__guest_test_arch_event(idx, i | INTEL_RDPMC_FIXED,
>  				MSR_CORE_PERF_FIXED_CTR0 + i,
>  				MSR_CORE_PERF_GLOBAL_CTRL,
>  				FIXED_PMC_GLOBAL_CTRL_ENABLE(i));
> @@ -546,7 +563,6 @@ static void test_fixed_counters(uint8_t pmu_version, uint64_t perf_capabilities,
>  
>  static void test_intel_counters(void)
>  {
> -	uint8_t nr_arch_events = kvm_cpu_property(X86_PROPERTY_PMU_EBX_BIT_VECTOR_LENGTH);
>  	uint8_t nr_fixed_counters = kvm_cpu_property(X86_PROPERTY_PMU_NR_FIXED_COUNTERS);
>  	uint8_t nr_gp_counters = kvm_cpu_property(X86_PROPERTY_PMU_NR_GP_COUNTERS);
>  	uint8_t pmu_version = kvm_cpu_property(X86_PROPERTY_PMU_VERSION);
> @@ -568,18 +584,26 @@ static void test_intel_counters(void)
>  
>  	/*
>  	 * Detect the existence of events that aren't supported by selftests.
> -	 * This will (obviously) fail any time the kernel adds support for a
> -	 * new event, but it's worth paying that price to keep the test fresh.
> +	 * This will (obviously) fail any time hardware adds support for a new
> +	 * event, but it's worth paying that price to keep the test fresh.
>  	 */
> -	TEST_ASSERT(nr_arch_events <= NR_INTEL_ARCH_EVENTS,
> +	TEST_ASSERT(this_cpu_property(X86_PROPERTY_PMU_EBX_BIT_VECTOR_LENGTH) <= NR_INTEL_ARCH_EVENTS,
>  		    "New architectural event(s) detected; please update this test (length = %u, mask = %x)",
> -		    nr_arch_events, kvm_cpu_property(X86_PROPERTY_PMU_EVENTS_MASK));
> +		    this_cpu_property(X86_PROPERTY_PMU_EBX_BIT_VECTOR_LENGTH),
> +		    this_cpu_property(X86_PROPERTY_PMU_EVENTS_MASK));
>  
>  	/*
> -	 * Force iterating over known arch events regardless of whether or not
> -	 * KVM/hardware supports a given event.
> +	 * Iterate over known arch events irrespective of KVM/hardware support
> +	 * to verify that KVM doesn't reject programming of events just because
> +	 * the *architectural* encoding is unsupported.  Track which events are
> +	 * supported in hardware; the guest side will validate supported events
> +	 * count correctly, even if *enumeration* of the event is unsupported
> +	 * by KVM and/or isn't exposed to the guest.
>  	 */
> -	nr_arch_events = max_t(typeof(nr_arch_events), nr_arch_events, NR_INTEL_ARCH_EVENTS);
> +	for (i = 0; i < NR_INTEL_ARCH_EVENTS; i++) {
> +		if (this_pmu_has(intel_event_to_feature(i).gp_event))
> +			hardware_pmu_arch_events |= BIT(i);
> +	}
>  
>  	for (v = 0; v <= max_pmu_version; v++) {
>  		for (i = 0; i < ARRAY_SIZE(perf_caps); i++) {
> @@ -595,8 +619,8 @@ static void test_intel_counters(void)
>  			 * vector length.
>  			 */
>  			if (v == pmu_version) {
> -				for (k = 1; k < (BIT(nr_arch_events) - 1); k++)
> -					test_arch_events(v, perf_caps[i], nr_arch_events, k);
> +				for (k = 1; k < (BIT(NR_INTEL_ARCH_EVENTS) - 1); k++)
> +					test_arch_events(v, perf_caps[i], NR_INTEL_ARCH_EVENTS, k);
>  			}
>  			/*
>  			 * Test single bits for all PMU version and lengths up
> @@ -605,11 +629,11 @@ static void test_intel_counters(void)
>  			 * host length).  Explicitly test a mask of '0' and all
>  			 * ones i.e. all events being available and unavailable.
>  			 */
> -			for (j = 0; j <= nr_arch_events + 1; j++) {
> +			for (j = 0; j <= NR_INTEL_ARCH_EVENTS + 1; j++) {
>  				test_arch_events(v, perf_caps[i], j, 0);
>  				test_arch_events(v, perf_caps[i], j, 0xff);
>  
> -				for (k = 0; k < nr_arch_events; k++)
> +				for (k = 0; k < NR_INTEL_ARCH_EVENTS; k++)
>  					test_arch_events(v, perf_caps[i], j, BIT(k));
>  			}
>