On Mon, Jul 15, 2024 at 1:05 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote: > > Hi Ian, > > On 2024-07-10 12:59 a.m., kernel test robot wrote: > > > > > > Hello, > > > > kernel test robot noticed "perf-sanity-tests.perf_all_PMU_test.fail" on: > > > > commit: e2641db83f18782f57a0e107c50d2d1731960fb8 ("perf vendor events: Add/update skylake events/metrics") > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master > > > > [test failed on linux-next/master 82d01fe6ee52086035b201cfa1410a3b04384257] > > > > in testcase: perf-sanity-tests > > version: > > with following parameters: > > > > perf_compiler: gcc > > > > > > > > compiler: gcc-13 > > test machine: 16 threads 1 sockets Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (Coffee Lake) with 32G memory > > > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > > > > we also observed two cases which also failed on parent can pass on this commit. > > FYI. > > > > > > caccae3ce7b988b6 e2641db83f18782f57a0e107c50 > > ---------------- --------------------------- > > fail:runs %reproduction fail:runs > > | | | > > :6 100% 6:6 perf-sanity-tests.perf_all_PMU_test.fail > > :6 100% 6:6 perf-sanity-tests.perf_all_metricgroups_test.pass > > :6 100% 6:6 perf-sanity-tests.perf_all_metrics_test.pass > > > > > > > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > the same patch/commit), kindly add following tags > > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx> > > | Closes: https://lore.kernel.org/oe-lkp/202407101021.2c8baddb-oliver.sang@xxxxxxxxx > > > > > > > > 2024-07-09 07:09:53 sudo /usr/src/linux-perf-x86_64-rhel-8.3-bpf-e2641db83f18782f57a0e107c50d2d1731960fb8/tools/perf/perf test 105 > > 105: perf all metricgroups test : Ok > > 2024-07-09 07:10:11 sudo /usr/src/linux-perf-x86_64-rhel-8.3-bpf-e2641db83f18782f57a0e107c50d2d1731960fb8/tools/perf/perf test 106 > > 106: perf all metrics test : Ok > > 2024-07-09 07:10:23 sudo /usr/src/linux-perf-x86_64-rhel-8.3-bpf-e2641db83f18782f57a0e107c50d2d1731960fb8/tools/perf/perf test 107 > > 107: perf all libpfm4 events test : Ok > > 2024-07-09 07:10:47 sudo /usr/src/linux-perf-x86_64-rhel-8.3-bpf-e2641db83f18782f57a0e107c50d2d1731960fb8/tools/perf/perf test 108 > > 108: perf all PMU test : FAILED! > > > > The failure is caused by the below change in the e2641db83f18. > > + { > + "BriefDescription": "This 48-bit fixed counter counts the UCLK > cycles", > + "Counter": "FIXED", > + "EventCode": "0xff", > + "EventName": "UNC_CLOCK.SOCKET", > + "PerPkg": "1", > + "PublicDescription": "This 48-bit fixed counter counts the UCLK > cycles.", > + "Unit": "cbox_0" > } > > The other cbox events have the unit name "CBOX", while the fixed counter > has a unit name "cbox_0". So the events_table will maintain separate > entries for cbox and cbox_0. > > The perf_pmus__print_pmu_events() calculate the total number of events, > allocate an aliases buffer, store all the events into the buffer, sort, > and print all the aliases one by one. > > The problem is that the calculated total number of events doesn't match > the stored events on the SKL machine. > > The perf_pmu__num_events() is used to calculate the number of events. It > invokes the pmu_events_table__num_events() to go through the entire > events_table to find all events. Because of the > pmu_uncore_alias_match(), the suffix of uncore PMU will be ignored. So > the events for cbox and cbox_0 are all counted. > > When storing events into the aliases buffer, the > perf_pmu__for_each_event() only process the events for cbox. > > Since a bigger buffer was allocated, the last entry are all 0. > When printing all the aliases, null will be outputed. > > $ perf list pmu > > List of pre-defined events (to be used in -e or -M): > > (null) [Kernel PMU event] > branch-instructions OR cpu/branch-instructions/ [Kernel PMU event] > branch-misses OR cpu/branch-misses/ [Kernel PMU event] > > > I'm thinking of two ways to address it. > One is to only print all the stored events. The below patch can fix it. > > diff --git a/tools/perf/util/pmus.c b/tools/perf/util/pmus.c > index 3fcabfd8fca1..2b2f5117ff84 100644 > --- a/tools/perf/util/pmus.c > +++ b/tools/perf/util/pmus.c > @@ -485,6 +485,7 @@ void perf_pmus__print_pmu_events(const struct > print_callbacks *print_cb, void *p > perf_pmu__for_each_event(pmu, skip_duplicate_pmus, &state, > perf_pmus__print_pmu_events__callback); > } > + len = state.index; > qsort(aliases, len, sizeof(struct sevent), cmp_sevent); > for (int j = 0; j < len; j++) { > /* Skip duplicates */ > > The only drawback is that perf list will not show the new cbox_0 event. > (But the event name still works. Users can still apply perf stat -e > unc_clock.socket.) > > Since the cbox_0 event is only available on old machines (SKL and > earlier), people should already use the equivalent kernel event. It > doesn't sounds a big issue for me. I prefer this simple fix. > > I think the other way would be to modify the perf_pmu__for_each_event() > to go through all the possible PMUs. > It seems complicated and may impact others ARCHs (e.g., S390). I haven't > tried it yet. > > What do you think? > Do you see any other ways to address the issue? Ugh. It seems the sizing and then iterating approach is just prone to keep breaking. Perhaps we can switch to realloc-ed arrays to avoid the need for perf_pmu__num_events, which seems to be the source of the problems. Thanks, Ian