Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2024-12-20 4:18 AM, Joshua Hahn wrote:
On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across NUMA nodes according to
user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

At the same time, we want these weights to be as small as possible.
Having ratios that involve large co-prime numbers like 7639:1345:7 leads
to awkward and inefficient allocations, since the node with weight 7
will remain mostly unused (and despite being proportional to bandwidth,
will not aid in relieving the pressure present in the other two nodes).

This patch introduces an auto-configuration for the interleave weights
that aims to balance the two goals of setting node weights to be
proportional to their bandwidths and keeping the weight values low.
This balance is controlled by a value "weightiness", which defines the
interleaving aggression. Higher values lead to less interleaving
(255:1), while lower values lead to more interleaving (1:1).

Large weightiness values generally lead to increased weight-bandwidth
proportionality, but can lead to underutilized nodes (think worst-case
scenario, which is 1:max_node_weight). Lower weightiness reduces the
effects of underutilized nodes, but may lead to improperly loaded
distributions.

s/max_node_weight/weightiness/

This knob is exposed as a sysfs interface with a default value of 32.
Weights are re-calculated once at boottime and then every time the knob
is changed by the user, or when the ACPI table is updated.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@xxxxxxxxxxxx/

Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>
Signed-off-by: Gregory Price <gourry@xxxxxxxxxx>
Co-Developed-by: Gregory Price <gourry@xxxxxxxxxx>

---
Changelog

v2:
- Name of the interface is changed from v1: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
   interleave weight table is initialized with the defaults, if bandwidth
   information is available.
   - In addition, all sections that handle iw_table have been changed
     to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
   support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
   behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!

  ...fs-kernel-mm-mempolicy-weighted-interleave |  36 ++++
  drivers/acpi/numa/hmat.c                      |   1 +
  drivers/base/node.c                           |   7 +
  include/linux/mempolicy.h                     |   4 +
  mm/mempolicy.c                                | 183 +++++++++++++++---
  5 files changed, 209 insertions(+), 22 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
index 0b7972de04e9..edb2c1f4753f 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -23,3 +23,39 @@ Description:	Weight configuration interface for nodeN
  		Writing an empty string or `0` will reset the weight to the
  		system default. The system default may be set by the kernel
  		or drivers at boot or during hotplug events.
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/weightiness
+Date:		December 2024
+Contact:	Linux memory management mailing list <linux-mm@xxxxxxxxx>
+Description:	Weight limiting / scaling interface
+
+		"Weightiness": a measure of interleave aggression between
+		memory nodes. Higher values lead to less interleaving (255:1),
+		while lower values lead to more interleaving (1:1).

It might be better to explain what low and high values of
weightness imply, like the way how you described
in the changelog?

+		When this value is updated, all node weights are re-calculated
+		to reflect the new weightiness. These re-calculated values
+		overwrite all existing node weights, including those manually
+		set by writing to the nodeN files.
+
+		Node weight re-calculation is performed by scaling down
+		bandwidth values reported in the ACPI HMAT to the range
+		[1, weightiness]. Note that re-calculation uses only the
+		weightiness parameter and bandwidth values, and ignores all
+		current node weights.
+
+		Minimum weight: 1
+		Default value: 32
+		Maximum weight: 255
+
+		Writing an empty string will set the value to be the default
+		(32). Writing a value outside the valid range  will return
+		EINVAL and will not re-trigger a weight scaling.
+
+		If there is no bandwidth data in the ACPI HMAT, then this file
+		will return ENODEV on an attempted write and perform no updates.
+		Furthermore, if there is no bandwidth information available,
+		all nodes' weights will default to 1.
+
+		Setting max_node_weight to 1 is equivalent to unweighted
+		interleave.

s/max_node_weight/weightiness/

@@ -3397,6 +3471,54 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
static struct iw_node_attr **node_attrs; +static ssize_t weightiness_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", weightiness);
+}
+
+static ssize_t weightiness_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	uint64_t *bw;
+	u8 *old_iw, *new_iw;
+	u8 new_weightiness;
+
+	if (count == 0 || sysfs_streq(buf, ""))
+		new_weightiness = 32;
+	else if (kstrtou8(buf, 0, &new_weightiness) || new_weightiness == 0)
+		return -EINVAL;
+
+	new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
+	if (!new_iw)
+		return -ENOMEM;

Could you please use kcalloc here similar to mempolicy_set_node_perf()?
Otherwise the patch looks fine to me. (will add a review and test on the
next revision)

By the way, this might be out of scope, but let me ask for my own
learning.

We have a server with 2 sockets, each attached with local DRAM and CXL memory (and thus 4 NUMA nodes). When accessing remote socket's memory
(either CXL or not), the bandwidth is limited by the interconnect's
bandwidth.

On this server, ideally weighted interleaving should be configured
within a socket (e.g. local NUMA node + local CXL node) because
weighted interleaving does not consider the bandwidth when accessed
from a remote socket.

So, the question is: On systems with multiple sockets (and CXL mem
attached to each socket), do you always assume the admin must bind to
a specific socket for optimal performance or is there any plan to
mitigate this problem without binding tasks to a socket?

+
+	mutex_lock(&iw_table_lock);
+	bw = node_bw_table;
+
+	if (!bw) {
+		mutex_unlock(&iw_table_lock);
+		kfree(new_iw);
+		return -ENODEV;
+	}
+
+	weightiness = new_weightiness;
+	old_iw = rcu_dereference_protected(iw_table,
+					   lockdep_is_held(&iw_table_lock));
+
+	reduce_interleave_weights(bw, new_iw);
+	rcu_assign_pointer(iw_table, new_iw);
+	mutex_unlock(&iw_table_lock);
+
+	synchronize_rcu();
+	kfree(old_iw);
+
+	return count;
+}
+
+static struct kobj_attribute wi_attr =
+	__ATTR(weightiness, 0664, weightiness_show, weightiness_store);
+
  static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
  				  struct kobject *parent)
  {
@@ -3413,6 +3535,7 @@ static void sysfs_wi_release(struct kobject *wi_kobj)
for (i = 0; i < nr_node_ids; i++)
  		sysfs_wi_node_release(node_attrs[i], wi_kobj);
+
  	kobject_put(wi_kobj);
  }
@@ -3454,6 +3577,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj)
  	return 0;
  }
+static struct attribute *wi_default_attrs[] = {
+	&wi_attr.attr,
+	NULL
+};
+
+static const struct attribute_group wi_attr_group = {
+	.attrs = wi_default_attrs,
+};
+
  static int add_weighted_interleave_group(struct kobject *root_kobj)
  {
  	struct kobject *wi_kobj;
@@ -3470,6 +3602,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
  		return err;
  	}
+ err = sysfs_create_group(wi_kobj, &wi_attr_group);
+	if (err) {
+		pr_err("failed to add sysfs [weightiness]\n");
+		kobject_put(wi_kobj);
+		return err;
+	}
+
  	for_each_node_state(nid, N_POSSIBLE) {
  		err = add_weight_node(nid, wi_kobj);
  		if (err) {





[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux