On 9/6/2022 7:05 PM, Like Xu wrote: > On 6/9/2022 4:16 pm, Sandipan Das wrote: >> Hi Like, >> >> On 8/19/2022 4:39 PM, Like Xu wrote: >>> From: Like Xu <likexu@xxxxxxxxxxx> >>> >>> For most unit tests, the basic framework and use cases which test >>> any PMU counter do not require any changes, except for two things: >>> >>> - No access to registers introduced only in PMU version 2 and above; >>> - Expanded tolerance for testing counter overflows >>> due to the loss of uniform control of the gloabl_ctrl register >>> >>> Adding some pmu_version() return value checks can seamlessly support >>> Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. >>> >>> Signed-off-by: Like Xu <likexu@xxxxxxxxxxx> >>> --- >>> x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ >>> 1 file changed, 43 insertions(+), 21 deletions(-) >>> >>> [...] >>> @@ -327,13 +335,21 @@ static void check_counter_overflow(void) >>> cnt.config &= ~EVNTSEL_INT; >>> idx = event_to_global_idx(&cnt); >>> __measure(&cnt, cnt.count); >>> - report(cnt.count == 1, "cntr-%d", i); >>> + >>> + report(check_irq() == (i % 2), "irq-%d", i); >>> + if (pmu_version() > 1) >>> + report(cnt.count == 1, "cntr-%d", i); >>> + else >>> + report(cnt.count < 4, "cntr-%d", i); >>> + >>> [...] >> >> Sorry I missed this in the previous response. With an upper bound of >> 4, I see this test failing some times for at least one of the six >> counters (with NMI watchdog disabled on the host) on a Milan (Zen 3) >> system. Increasing it further does reduce the probability but I still >> see failures. Do you see the same behaviour on systems with Zen 3 and >> older processors? > > A hundred runs on my machine did not report a failure. > Was this on a Zen 4 system? > But I'm not surprised by this, because some AMD platforms do > have hw PMU errata which requires bios or ucode fixes. > > Please help find the right upper bound for all your available AMD boxes. > Even after updating the microcode, the tests failed just as often in an overnight loop. However, upon closer inspection, the reason for failure was different. The variance is well within the bounds now but sometimes, is_the_count_reproducible() is true. Since this selects the original verification criteria (cnt.count == 1), the tests fail. > What makes me most nervous is that AMD's core hardware events run > repeatedly against the same workload, and their count results are erratic. > With that in mind, should we consider having the following change? diff --git a/x86/pmu.c b/x86/pmu.c index bb16b3c..39979b8 100644 --- a/x86/pmu.c +++ b/x86/pmu.c @@ -352,7 +352,7 @@ static void check_counter_overflow(void) .ctr = gp_counter_base, .config = EVNTSEL_OS | EVNTSEL_USR | (*gp_events)[1].unit_sel /* instructions */, }; - bool precise_event = is_the_count_reproducible(&cnt); + bool precise_event = is_intel() ? is_the_count_reproducible(&cnt) : false; __measure(&cnt, 0); count = cnt.count; With this, the tests always pass. I will run another overnight loop and report back if I see any errors. > You may check is_the_count_reproducible() in the test case: > [1]https://lore.kernel.org/kvm/20220905123946.95223-7-likexu@xxxxxxxxxxx/ On Zen 4 systems, this is always false and the overflow tests always pass irrespective of whether PerfMonV2 is enabled for the guest or not. - Sandipan