Re: [PATCH v4 2/2] selftests/resctrl: Adjust SNC support messages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Maciej,

On 7/12/24 2:04 AM, Maciej Wieczor-Retman wrote:
Resctrl selftest prints a message on test failure that Sub-Numa
Clustering (SNC) could be enabled and points the user to check their BIOS
settings. No actual check is performed before printing that message so
it is not very accurate in pinpointing a problem.

Figuring out if SNC is enabled is only one part of the problem, the
others being whether the detected SNC mode is reliable and whether the
kernel supports SNC in resctrl.

When there is SNC support for kernel's resctrl subsystem and SNC is
enabled then sub node files are created for each node in the resctrlfs.
The sub node files exist in each regular node's L3 monitoring directory.
The reliable path to check for existence of sub node files is
/sys/fs/resctrl/mon_data/mon_L3_00/mon_sub_L3_00.

To check if SNC detection is reliable one can check the
/sys/devices/system/cpu/offline file. If it's empty, it means all cores
are operational and the ratio should be calculated correctly. If it has
any contents, it means the detected SNC mode can't be trusted and should
be disabled.


This belongs in previous patch.

Add helpers for all operations mentioned above.

Detect SNC mode once and let other tests inherit that information.

This belongs to previous patch.


Add messages to alert the user when SNC detection could return incorrect
results. Correct old messages to account for kernel support of SNC in
resctrl.

Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@xxxxxxxxx>
---
Changelog v4:
- Change messages at the end of tests and at the start of
   run_single_test. (Reinette)
- Add messages at the end of CAT since it can also fail due to enabled
   SNC + lack of kernel support.
- Remove snc_mode global variable. (Reinette)
- Fix wrong description of snc_kernel_support(). (Reinette)
- Move call to cpus_offline_empty() into snc_nodes_per_l3_cache() so the
   whole detection flow is in one place as discussed. (Reinette)

Changelog v3:
- Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette)
- Add printing the discovered SNC mode. (Reinette)
- Change method of kernel support discovery from cache sizes to
   existance of sub node files.
- Check if SNC detection is unreliable.
- Move SNC detection to only the first run_single_test() instead on
   error at the end of test runs.
- Add global value to remind user at the end of relevant tests if SNC
   detection was found to be unreliable.
- Redo the patch message after the changes.

Changelog v2:
- Move snc_ways() checks from individual tests into
   snc_kernel_support().
- Write better comment for snc_kernel_support().

  tools/testing/selftests/resctrl/cat_test.c    |  8 +++
  tools/testing/selftests/resctrl/cmt_test.c    | 10 +++-
  tools/testing/selftests/resctrl/mba_test.c    |  7 +++
  tools/testing/selftests/resctrl/mbm_test.c    |  9 ++-
  tools/testing/selftests/resctrl/resctrl.h     |  3 +
  .../testing/selftests/resctrl/resctrl_tests.c |  8 ++-
  tools/testing/selftests/resctrl/resctrlfs.c   | 57 +++++++++++++++++++
  7 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c
index d4dffc934bc3..a8bb49f56755 100644
--- a/tools/testing/selftests/resctrl/cat_test.c
+++ b/tools/testing/selftests/resctrl/cat_test.c
@@ -285,6 +285,14 @@ static int cat_run_test(const struct resctrl_test *test, const struct user_param
ret = check_results(&param, test->resource,
  			    cache_total_size, full_cache_mask, start_mask);
+	if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
+		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n");

The kernel support only applies to monitoring, there is no kernel support/changes related to CAT when SNC
is enabled.

+
+	if ((get_vendor() == ARCH_INTEL) && snc_unreliable) {
+		ksft_print_msg("Sub-NUMA Clustering could not be detected properly (see earlier messages for details).\n");
+		ksft_print_msg("Intel CAT may be inaccurate.\n");
+	}

This is still relevant but unclear why previous message checked "ret" but above does not.

+
  	return ret;
  }
diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
index 0c045080d808..471e134face0 100644
--- a/tools/testing/selftests/resctrl/cmt_test.c
+++ b/tools/testing/selftests/resctrl/cmt_test.c
@@ -175,8 +175,14 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
  		goto out;
ret = check_results(&param, span, n);
-	if (ret && (get_vendor() == ARCH_INTEL))
-		ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
+	if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
+		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n");
+
+	if ((get_vendor() == ARCH_INTEL) && snc_unreliable) {
+		ksft_print_msg("Sub-NUMA Clustering could not be detected properly (see earlier messages for details).\n");
+		ksft_print_msg("Intel CMT may be inaccurate.\n");
+	}
+

CMT may be inaccurate in both scenarios (no kernel support or unreliable detection). Why only
check "ret" in case there is no kernel support?

out:
  	free(span_str);
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
index ab8496a4925b..a805c14fe04b 100644
--- a/tools/testing/selftests/resctrl/mba_test.c
+++ b/tools/testing/selftests/resctrl/mba_test.c
@@ -179,6 +179,13 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param
  		return ret;
ret = check_results();
+	if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
+		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n");
+
+	if ((get_vendor() == ARCH_INTEL) && snc_unreliable) {
+		ksft_print_msg("Sub-NUMA Clustering could not be detected properly (see earlier messages for details).\n");
+		ksft_print_msg("Intel MBA may be inaccurate.\n");
+	}

As I understand there is no change to MBA when SNC is enabled. These additions thus seem unnecessary.

return ret;
  }
diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
index 6b5a3b52d861..ce3c86989f8b 100644
--- a/tools/testing/selftests/resctrl/mbm_test.c
+++ b/tools/testing/selftests/resctrl/mbm_test.c
@@ -147,8 +147,13 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param
  		return ret;
ret = check_results(DEFAULT_SPAN);
-	if (ret && (get_vendor() == ARCH_INTEL))
-		ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
+	if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
+		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n");
+
+	if ((get_vendor() == ARCH_INTEL) && snc_unreliable) {
+		ksft_print_msg("Sub-NUMA Clustering could not be detected properly (see earlier messages for details).\n");
+		ksft_print_msg("Intel MBM may be inaccurate.\n");
+	}
return ret;
  }
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 851b37c9c38a..488bdca01e4f 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -121,6 +121,8 @@ struct perf_event_read {
   */
  extern volatile int *value_sink;
+extern int snc_unreliable;
+
  extern char llc_occup_path[1024];
int snc_nodes_per_l3_cache(void);
@@ -167,6 +169,7 @@ void ctrlc_handler(int signum, siginfo_t *info, void *ptr);
  int signal_handler_register(const struct resctrl_test *test);
  void signal_handler_unregister(void);
  unsigned int count_bits(unsigned long n);
+int snc_kernel_support(void);
void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config);
  void perf_event_initialize_read_format(struct perf_event_read *pe_read);
diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c
index ecbb7605a981..4b84d6199a36 100644
--- a/tools/testing/selftests/resctrl/resctrl_tests.c
+++ b/tools/testing/selftests/resctrl/resctrl_tests.c
@@ -118,11 +118,17 @@ static bool test_vendor_specific_check(const struct resctrl_test *test)
static void run_single_test(const struct resctrl_test *test, const struct user_params *uparams)
  {
-	int ret;
+	int ret, snc_mode;
if (test->disabled)
  		return;
+ snc_mode = snc_nodes_per_l3_cache();
+	if (snc_mode > 1)
+		ksft_print_msg("SNC-%d mode discovered.\n", snc_mode);
+	else if (snc_unreliable)
+		ksft_print_msg("SNC detection unreliable due to offline CPUs. Test results may not be accurate if SNC enabled.\n");
+
  	if (!test_vendor_specific_check(test)) {
  		ksft_test_result_skip("Hardware does not support %s\n", test->name);
  		return;
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index 803dd415984c..4d0dbb332b8f 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -13,6 +13,8 @@
#include "resctrl.h" +int snc_unreliable;
+
  static int find_resctrl_mount(char *buffer)
  {
  	FILE *mounts;
@@ -186,6 +188,25 @@ static unsigned int count_sys_bitmap_bits(char *name)
  	return count;
  }
+static bool cpus_offline_empty(void)
+{
+	char offline_cpus_str[64];
+	FILE *fp;
+
+	fp = fopen("/sys/devices/system/cpu/offline", "r");
+	if (fscanf(fp, "%s", offline_cpus_str) < 0) {
+		if (!errno) {
+			fclose(fp);
+			return 1;
+		}
+		ksft_perror("Could not read offline CPUs file!");

No need to scream.

I think it will be more useful to replace "offline CPUs file" with
specific "/sys/devices/system/cpu/offline".

+	}
+
+	fclose(fp);
+
+	return 0;
+}
+
  /*
   * Detect SNC by comparing #CPUs in node0 with #CPUs sharing LLC with CPU0.
   * If some CPUs are offline the numbers may not be exact multiples of each
@@ -199,6 +220,13 @@ int snc_nodes_per_l3_cache(void)
  	static int snc_mode;
if (!snc_mode) {
+		if (!cpus_offline_empty()) {
+			ksft_print_msg("Runtime SNC detection unreliable due to offline CPUs.\n");
+			ksft_print_msg("Setting SNC mode to disabled.\n");
+			snc_mode = 1;
+			snc_unreliable = 1;
+			return snc_mode;
+		}

This can be moved to previous patch and that for loop simplified/removed.

  		node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
  		cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
@@ -942,3 +970,32 @@ unsigned int count_bits(unsigned long n) return count;
  }
+
+/**
+ * snc_kernel_support - Check for existence of mon_sub_L3_00 file that indicates
+ * SNC resctrl support on the kernel side.
+ *
+ * Return: 0 if not supported, 1 if SNC is disabled or SNC is both enabled and
+ * supported.
+ */
+int snc_kernel_support(void)
+{
+	char node_path[PATH_MAX];
+	struct stat statbuf;
+	int ret;
+
+	ret = snc_nodes_per_l3_cache();
+	/*
+	 * If SNC is disabled then its kernel support isn't important.

Please expand comment that this does not take unreliable SNC detection into account and
needs to be done separately.

+	 */
+	if (ret == 1)
+		return ret;
+
+	snprintf(node_path, sizeof(node_path), "%s/%s/%s", RESCTRL_PATH, "mon_data",
+		 "mon_L3_00/mon_sub_L3_00");
+
+	if (!stat(node_path, &statbuf))
+		return 1;
+
+	return 0;
+}

Reinette




[Index of Archives]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Device Mapper]

  Powered by Linux