Hi Maciej, On 10/29/24 6:00 AM, Maciej Wieczor-Retman wrote: > Resctrl selftest prints a message on test failure that Sub-Numa > Clustering (SNC) could be enabled and points the user to check their BIOS > settings. No actual check is performed before printing that message so > it is not very accurate in pinpointing a problem. > > Figuring out if SNC is enabled is only one part of the problem, the > others being whether the detected SNC mode is reliable and whether the > kernel supports SNC in resctrl. Starting to sound like previous patch ... > > When there is SNC support for kernel's resctrl subsystem and SNC is > enabled then sub node files are created for each node in the resctrlfs. > The sub node files exist in each regular node's L3 monitoring directory. > The reliable path to check for existence of sub node files is > /sys/fs/resctrl/mon_data/mon_L3_00/mon_sub_L3_00. ... this is about previous patch ... > > To check if SNC detection is reliable one can check the > /sys/devices/system/cpu/offline file. If it's empty, it means all cores > are operational and the ratio should be calculated correctly. If it has > any contents, it means the detected SNC mode can't be trusted and should > be disabled. ... also about previous patch ... > > Add helpers for all operations mentioned above. Not done in this patch. > > Detect SNC mode once and let other tests inherit that information. Not done in this patch. > > Add messages to alert the user when SNC detection could return incorrect > results. Correct old messages to account for kernel support of SNC in > resctrl. Sounds like description of what earlier version of this patch did ... > > Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@xxxxxxxxx> > --- > Changelog v5: > - Move all resctrlfs.c code from this patch to 1/2. (Reinette) Please update the changelog to match the changes made to the patch. > - Remove kernel support check and error message from CAT since it can't > be happen. > - Remove snc checks in CAT since snc doesn't affect it here. > - Skip MBM, MBA and CMT tests if snc is unreliable. > > Changelog v4: > - Change messages at the end of tests and at the start of > run_single_test. (Reinette) > - Add messages at the end of CAT since it can also fail due to enabled > SNC + lack of kernel support. > - Remove snc_mode global variable. (Reinette) > - Fix wrong description of snc_kernel_support(). (Reinette) > - Move call to cpus_offline_empty() into snc_nodes_per_l3_cache() so the > whole detection flow is in one place as discussed. (Reinette) > > Changelog v3: > - Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette) > - Add printing the discovered SNC mode. (Reinette) > - Change method of kernel support discovery from cache sizes to > existance of sub node files. > - Check if SNC detection is unreliable. > - Move SNC detection to only the first run_single_test() instead on > error at the end of test runs. > - Add global value to remind user at the end of relevant tests if SNC > detection was found to be unreliable. > - Redo the patch message after the changes. > > Changelog v2: > - Move snc_ways() checks from individual tests into > snc_kernel_support(). > - Write better comment for snc_kernel_support(). > > tools/testing/selftests/resctrl/cmt_test.c | 8 ++++++-- > tools/testing/selftests/resctrl/mba_test.c | 8 +++++++- > tools/testing/selftests/resctrl/mbm_test.c | 10 +++++++--- > tools/testing/selftests/resctrl/resctrl.h | 3 +++ > 4 files changed, 23 insertions(+), 6 deletions(-) > > diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c > index 0c045080d808..1470bd64d158 100644 > --- a/tools/testing/selftests/resctrl/cmt_test.c > +++ b/tools/testing/selftests/resctrl/cmt_test.c > @@ -133,6 +133,10 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param > ret = get_cache_size(uparams->cpu, "L3", &cache_total_size); > if (ret) > return ret; > + > + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) > + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); > + This (inconsistent) duplication can be moved to run_single_test(). > ksft_print_msg("Cache size :%lu\n", cache_total_size); > > count_of_bits = count_bits(long_mask); > @@ -175,8 +179,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param > goto out; > > ret = check_results(¶m, span, n); > - if (ret && (get_vendor() == ARCH_INTEL)) > - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); > + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support()) > + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); > > out: > free(span_str); > diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c > index ab8496a4925b..8f4e198da047 100644 > --- a/tools/testing/selftests/resctrl/mba_test.c > +++ b/tools/testing/selftests/resctrl/mba_test.c > @@ -170,15 +170,21 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param > .setup = mba_setup, > .measure = mba_measure, > }; > - int ret; > + int ret, snc_support; > > remove(RESULT_FILE_NAME); > > + snc_support = snc_kernel_support(); > + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) > + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); > + > ret = resctrl_val(test, uparams, uparams->benchmark_cmd, ¶m); > if (ret) > return ret; > > ret = check_results(); > + if (ret && (get_vendor() == ARCH_INTEL) && !snc_support) > + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); > > return ret; > } > diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c > index 6b5a3b52d861..a68f70589b91 100644 > --- a/tools/testing/selftests/resctrl/mbm_test.c > +++ b/tools/testing/selftests/resctrl/mbm_test.c > @@ -138,17 +138,21 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param > .setup = mbm_setup, > .measure = mbm_measure, > }; > - int ret; > + int ret, snc_support; > > remove(RESULT_FILE_NAME); > > + snc_support = snc_kernel_support(); > + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) > + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); > + > ret = resctrl_val(test, uparams, uparams->benchmark_cmd, ¶m); > if (ret) > return ret; > > ret = check_results(DEFAULT_SPAN); > - if (ret && (get_vendor() == ARCH_INTEL)) > - ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); > + if (ret && (get_vendor() == ARCH_INTEL) && !snc_support) > + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); > > return ret; > } > diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h > index 851b37c9c38a..488bdca01e4f 100644 > --- a/tools/testing/selftests/resctrl/resctrl.h > +++ b/tools/testing/selftests/resctrl/resctrl.h > @@ -121,6 +121,8 @@ struct perf_event_read { > */ > extern volatile int *value_sink; > > +extern int snc_unreliable; > + > extern char llc_occup_path[1024]; > > int snc_nodes_per_l3_cache(void); > @@ -167,6 +169,7 @@ void ctrlc_handler(int signum, siginfo_t *info, void *ptr); > int signal_handler_register(const struct resctrl_test *test); > void signal_handler_unregister(void); > unsigned int count_bits(unsigned long n); > +int snc_kernel_support(void); > > void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config); > void perf_event_initialize_read_format(struct perf_event_read *pe_read); Reinette