Hi, Joshua, Thanks for your patch and sorry for late reply. Joshua Hahn <joshua.hahnjy@xxxxxxxxx> writes: > On machines with multiple memory nodes, interleaving page allocations > across nodes allows for better utilization of each node's bandwidth. > Previous work by Gregory Price [1] introduced weighted interleave, which > allowed for pages to be allocated across nodes according to user-set ratios. > > Ideally, these weights should be proportional to their bandwidth, so > that under bandwidth pressure, each node uses its maximal efficient > bandwidth and prevents latency from increasing exponentially. > > At the same time, we want these weights to be as small as possible. > Having ratios that involve large co-prime numbers like 7639:1345:7 leads > to awkward and inefficient allocations, since the node with weight 7 > will remain mostly unused (and despite being proportional to bandwidth, > will not aid in relieving the bandwidth pressure in the other two nodes). > > This patch introduces an auto-configuration mode for the interleave > weights that aims to balance the two goals of setting node weights to be > proportional to their bandwidths and keeping the weight values low. > In order to perform the weight re-scaling, we use an internal > "weightiness" value (fixed to 32) that defines interleave aggression. As asked by Andrew in another thread, you may need to make it more explicit about why we need this patch. > In this auto configuration mode, node weights are dynamically updated > every time there is a hotplug event that introduces new bandwidth. > > Users can also enter manual mode by writing "N" or "0" to the new "auto" > sysfs interface. When a user enters manual mode, the system stops > dynamically updating any of the node weights, even during hotplug events > that can shift the optimal weight distribution. The system also enters > manual mode any time a user sets a node's weight directly by using the > nodeN interface introduced in [1]. On the other hand, auto mode is > only entered by explicitly writing "Y" or "1" to the auto interface. > > There is one functional change that this patch makes to the existing > weighted_interleave ABI: previously, writing 0 directly to a nodeN > interface was said to reset the weight to the system default. Before > this patch, the default for all weights were 1, which meant that writing > 0 and 1 were functionally equivalent. > > This patch introduces "real" defaults, but moves away from letting users > use 0 as a "set to default" interface. Rather, users who want to use > system defaults should use auto mode. This patch seems to be the > appropriate place to make this change, since we would like to remove > this usage before users begin to rely on the feature in userspace. > Moreover, users will not be losing any functionality; they can still > write 1 into a node if they want a weight of 1. Thus, we deprecate the > "write zero to reset" feature in favor of returning an error, the same > way we would return an error when the user writes any other invalid > weight to the interface. > > [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@xxxxxxxxxxxx/ > > Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx> > Co-developed-by: Gregory Price <gourry@xxxxxxxxxx> > Signed-off-by: Gregory Price <gourry@xxxxxxxxxx> > Reviewed-by: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> > --- > Changelog > v5: > - I accidentally forgot to add the mm/mempolicy: subject tag since v1 of > this patch. Added to the subject now! > - Wordsmithing, correcting typos, and re-naming variables for clarity. > - No functional changes. > v4: > - Renamed the mode interface to the "auto" interface, which now only > emits either 'Y' or 'N'. Users can now interact with it by > writing 'Y', '1', 'N', or '0' to it. > - Added additional documentation to the nodeN sysfs interface. > - Makes sure iw_table locks are properly held. > - Removed unlikely() call in reduce_interleave_weights. > - Wordsmithing > > v3: > - Weightiness (max_node_weight) is now fixed to 32. > - Instead, the sysfs interface now exposes a "mode" parameter, which > can either be "auto" or "manual". > - Thank you Hyeonggon and Honggyu for the feedback. > - Documentation updated to reflect new sysfs interface, explicitly > specifies that 0 is invalid. > - Thank you Gregory and Ying for the discussion on how best to > handle the 0 case. > - Re-worked nodeN sysfs store to handle auto --> manual shifts > - mempolicy_set_node_perf internally handles the auto / manual > case differently now. bw is always updated, iw updates depend on > what mode the user is in. > - Wordsmithing comments for clarity. > - Removed RFC tag. > > v2: > - Name of the interface is changed: "max_node_weight" --> "weightiness" > - Default interleave weight table no longer exists. Rather, the > interleave weight table is initialized with the defaults, if bandwidth > information is available. > - In addition, all sections that handle iw_table have been changed > to reference iw_table if it exists, otherwise defaulting to 1. > - All instances of unsigned long are converted to uint64_t to guarantee > support for both 32-bit and 64-bit machines > - sysfs initialization cleanup > - Documentation has been rewritten to explicitly outline expected > behavior and expand on the interpretation of "weightiness". > - kzalloc replaced with kcalloc for readability > - Thank you Gregory and Hyeonggon for your review & feedback! > ...fs-kernel-mm-mempolicy-weighted-interleave | 38 +++- > drivers/acpi/numa/hmat.c | 1 + > drivers/base/node.c | 7 + > include/linux/mempolicy.h | 4 + > mm/mempolicy.c | 200 ++++++++++++++++-- > 5 files changed, 233 insertions(+), 17 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > index 0b7972de04e9..ef43228d135d 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > @@ -20,6 +20,38 @@ Description: Weight configuration interface for nodeN > Minimum weight: 1 > Maximum weight: 255 > > - Writing an empty string or `0` will reset the weight to the > - system default. The system default may be set by the kernel > - or drivers at boot or during hotplug events. > + Writing invalid values (i.e. any values not in [1,255], > + empty string, ...) will return -EINVAL. > + > + Changing the weight to a valid value will automatically > + update the system to manual mode as well. > + > +What: /sys/kernel/mm/mempolicy/weighted_interleave/auto > +Date: February 2025 > +Contact: Linux memory management mailing list <linux-mm@xxxxxxxxx> > +Description: Auto-weighting configuration interface > + > + Configuration mode for weighted interleave. A 'Y' indicates > + that the system is in auto mode, and a 'N' indicates that > + the system is in manual mode. All other values are invalid. > + > + In auto mode, all node weights are re-calculated and overwritten > + (visible via the nodeN interfaces) whenever new bandwidth data > + is made available during either boot or hotplug events. > + > + In manual mode, node weights can only be updated by the user. > + Note that nodes that are onlined with previously set / defaulted Sorry, I don't understand this well. What is "defaulted" weights? Why not just "previously set"? > + weights will inherit those weights. If they were not previously > + set or are onlined with missing bandwidth data, they will be > + defaulted to a weight of 1. > + > + Writing Y or 1 to the interface will enable auto mode, while > + writing N or 0 will enable manual mode. All other strings will > + be ignored, and -EINVAL will be returned. > + > + If Y or 1 is written to the interface but the recalculation or > + updates fail at any point (-ENOMEM or -ENODEV), then the system > + will remain in manual mode. IMHO, it's unnecessary to make this a part of the interface definition. However, I have no strong opinion on this. > + Writing a new weight to a node directly via the nodeN interface > + will also automatically update the system to manual mode. > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c > index 80a3481c0470..cc94cba112dd 100644 > --- a/drivers/acpi/numa/hmat.c > +++ b/drivers/acpi/numa/hmat.c > @@ -20,6 +20,7 @@ > #include <linux/list_sort.h> > #include <linux/memregion.h> > #include <linux/memory.h> > +#include <linux/mempolicy.h> > #include <linux/mutex.h> > #include <linux/node.h> > #include <linux/sysfs.h> This change should be removed? > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 0ea653fa3433..16e7a5a8ebe7 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -7,6 +7,7 @@ > #include <linux/init.h> > #include <linux/mm.h> > #include <linux/memory.h> > +#include <linux/mempolicy.h> > #include <linux/vmstat.h> > #include <linux/notifier.h> > #include <linux/node.h> > @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord, > break; > } > } > + > + /* When setting CPU access coordinates, update mempolicy */ > + if (access == ACCESS_COORDINATE_CPU) { > + if (mempolicy_set_node_perf(nid, coord)) > + pr_info("failed to set node%d mempolicy attrs\n", nid); > + } > } > EXPORT_SYMBOL_GPL(node_set_perf_attrs); > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index ce9885e0178a..0fe96f3ab3ef 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -11,6 +11,7 @@ > #include <linux/slab.h> > #include <linux/rbtree.h> > #include <linux/spinlock.h> > +#include <linux/node.h> > #include <linux/nodemask.h> > #include <linux/pagemap.h> > #include <uapi/linux/mempolicy.h> > @@ -178,6 +179,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) > > extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone); > > +extern int mempolicy_set_node_perf(unsigned int node, > + struct access_coordinate *coords); > + > #else > > struct mempolicy {}; > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 04f35659717a..51edd3663667 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -109,6 +109,7 @@ > #include <linux/mmu_notifier.h> > #include <linux/printk.h> > #include <linux/swapops.h> > +#include <linux/gcd.h> > > #include <asm/tlbflush.h> > #include <asm/tlb.h> > @@ -138,16 +139,18 @@ static struct mempolicy default_policy = { > > static struct mempolicy preferred_node_policy[MAX_NUMNODES]; > > +static uint64_t *node_bw_table; Usually, we use "u64" instead of "uint64_t" in kernel. It appears that we don't need u64. "unsigned int" should be enough because "struct access_coordinate" uses that too. It's better to move this after iw_table definition below. And make it explicit that iw_table_lock protects all interleave weight global variables. > + > /* > - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes > - * system-default value should be used. A NULL iw_table also denotes that > - * system-default values should be used. Until the system-default table > - * is implemented, the system-default is always 1. > - * > + * iw_table is the interleave weight table. > + * If bandwidth data is available and the user is in auto mode, the table ~~~~ system? > + * is populated with default values in [1,255]. If system is in manual mode, the weight value is in [1, 255] too? NULL iw_table is still a valid configuration, so we still need to describe its behavior. Right? > * iw_table is RCU protected > */ > static u8 __rcu *iw_table; > static DEFINE_MUTEX(iw_table_lock); > +static const int weightiness = 32; > +static bool weighted_interleave_auto = true; > > static u8 get_il_weight(int node) > { > @@ -156,14 +159,114 @@ static u8 get_il_weight(int node) > > rcu_read_lock(); > table = rcu_dereference(iw_table); > - /* if no iw_table, use system default */ It appears that this applies at some degree. > weight = table ? table[node] : 1; > - /* if value in iw_table is 0, use system default */ > - weight = weight ? weight : 1; > rcu_read_unlock(); > return weight; > } > > +/* > + * Convert bandwidth values into weighted interleave weights. > + * Call with iw_table_lock. > + */ > +static void reduce_interleave_weights(uint64_t *bw, u8 *new_iw) > +{ > + uint64_t sum_bw = 0, sum_iw = 0; > + uint64_t scaling_factor = 1, iw_gcd = 1; > + unsigned int i = 0; > + > + /* Recalculate the bandwidth distribution given the new info */ > + for (i = 0; i < nr_node_ids; i++) > + sum_bw += bw[i]; > + > + /* If node is not set or has < 1% of total bw, use minimum value of 1 */ > + for (i = 0; i < nr_node_ids; i++) { > + if (bw[i]) { > + scaling_factor = 100 * bw[i]; > + new_iw[i] = max(scaling_factor / sum_bw, 1); > + } else { > + new_iw[i] = 1; > + } > + sum_iw += new_iw[i]; > + } > + > + /* > + * Scale each node's share of the total bandwidth from percentages > + * to whole numbers in the range [1, weightiness] > + */ > + for (i = 0; i < nr_node_ids; i++) { > + scaling_factor = weightiness * new_iw[i]; > + new_iw[i] = max(scaling_factor / sum_iw, 1); > + if (i == 0) > + iw_gcd = new_iw[0]; > + iw_gcd = gcd(iw_gcd, new_iw[i]); > + } > + > + /* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */ > + for (i = 0; i < nr_node_ids; i++) > + new_iw[i] /= iw_gcd; > +} > + > +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) > +{ > + uint64_t *old_bw, *new_bw; > + uint64_t bw_val; > + u8 *old_iw, *new_iw; > + > + /* > + * Bandwidths above this limit cause rounding errors when reducing > + * weights. This value is ~16 exabytes, which is unreasonable anyways. > + */ > + bw_val = min(coords->read_bandwidth, coords->write_bandwidth); > + if (bw_val > (U64_MAX / 10)) > + return -EINVAL; > + > + new_bw = kcalloc(nr_node_ids, sizeof(uint64_t), GFP_KERNEL); > + if (!new_bw) > + return -ENOMEM; > + > + new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL); > + if (!new_iw) { > + kfree(new_bw); > + return -ENOMEM; > + } > + > + /* > + * Update bandwidth info, even in manual mode. That way, when switching > + * to auto mode in the future, iw_table can be overwritten using > + * accurate bw data. > + */ > + mutex_lock(&iw_table_lock); > + old_bw = node_bw_table; > + old_iw = rcu_dereference_protected(iw_table, > + lockdep_is_held(&iw_table_lock)); > + > + if (old_bw) > + memcpy(new_bw, old_bw, nr_node_ids * sizeof(uint64_t)); > + new_bw[node] = bw_val; > + node_bw_table = new_bw; > + > + if (weighted_interleave_auto) { > + reduce_interleave_weights(new_bw, new_iw); > + } else if (old_iw) { > + /* > + * The first time mempolicy_set_node_perf is called, old_iw > + * (iw_table) is null. If that is the case, assign a zeroed > + * table to it. Otherwise, free the newly allocated iw_table. We shouldn't assign a zeroed iw_table? Because 0 isn't valid now. Please check changes in weighted_interleave_nid() below. > + */ > + mutex_unlock(&iw_table_lock); > + kfree(new_iw); > + kfree(old_bw); > + return 0; > + } > + > + rcu_assign_pointer(iw_table, new_iw); > + mutex_unlock(&iw_table_lock); > + synchronize_rcu(); > + kfree(old_iw); If you want to save one synchronize_rcu() if unnecessary, you can use if (old_iw) { synchronize_rcu(); kfree(old_iw); } > + kfree(old_bw); > + return 0; > +} > + > /** > * numa_nearest_node - Find nearest node by state > * @node: Node id to start the search > @@ -1998,10 +2101,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > table = rcu_dereference(iw_table); > /* calculate the total weight */ > for_each_node_mask(nid, nodemask) { > - /* detect system default usage */ > - weight = table ? table[nid] : 1; > - weight = weight ? weight : 1; > - weight_total += weight; > + weight_total += table ? table[nid] : 1; > } > > /* Calculate the node offset based on totals */ > @@ -2010,7 +2110,6 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > while (target) { > /* detect system default usage */ > weight = table ? table[nid] : 1; > - weight = weight ? weight : 1; > if (target < weight) > break; > target -= weight; > @@ -3394,7 +3493,7 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, > node_attr = container_of(attr, struct iw_node_attr, kobj_attr); > if (count == 0 || sysfs_streq(buf, "")) > weight = 0; > - else if (kstrtou8(buf, 0, &weight)) > + else if (kstrtou8(buf, 0, &weight) || weight == 0) > return -EINVAL; > > new = kzalloc(nr_node_ids, GFP_KERNEL); > @@ -3411,11 +3510,68 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, > mutex_unlock(&iw_table_lock); > synchronize_rcu(); > kfree(old); > + weighted_interleave_auto = false; > return count; > } > > static struct iw_node_attr **node_attrs; > > +static ssize_t weighted_interleave_mode_show(struct kobject *kobj, IMHO, it's better to name the function as weighted_interleave_auto_show()? > + struct kobj_attribute *attr, char *buf) > +{ > + if (weighted_interleave_auto) > + return sysfs_emit(buf, "Y\n"); > + else > + return sysfs_emit(buf, "N\n"); str_true_false() can be used here. > +} > + > +static ssize_t weighted_interleave_mode_store(struct kobject *kobj, > + struct kobj_attribute *attr, const char *buf, size_t count) > +{ > + uint64_t *bw; > + u8 *old_iw, *new_iw; > + > + if (count == 0) > + return -EINVAL; > + > + if (sysfs_streq(buf, "N") || sysfs_streq(buf, "0")) { kstrtobool() can be used here. It can deal with 'count == 0' case too. > + weighted_interleave_auto = false; > + return count; > + } else if (!sysfs_streq(buf, "Y") && !sysfs_streq(buf, "1")) { > + return -EINVAL; > + } > + > + new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL); > + if (!new_iw) > + return -ENOMEM; > + > + mutex_lock(&iw_table_lock); > + bw = node_bw_table; > + > + if (!bw) { > + mutex_unlock(&iw_table_lock); > + kfree(new_iw); > + return -ENODEV; > + } > + > + old_iw = rcu_dereference_protected(iw_table, > + lockdep_is_held(&iw_table_lock)); > + > + reduce_interleave_weights(bw, new_iw); > + rcu_assign_pointer(iw_table, new_iw); > + mutex_unlock(&iw_table_lock); > + > + synchronize_rcu(); > + kfree(old_iw); > + > + weighted_interleave_auto = true; Why assign weighted_interleave_auto after synchronize_rcu()? To reduce the race window, it's better to change weighted_interleave_auto and iw_table together? Is it better to put them into a data structure and change them together always? struct weighted_interleave_state { bool weighted_interleave_auto; u8 iw_table[0] }; > + return count; > +} > + > +static struct kobj_attribute wi_attr = > + __ATTR(auto, 0664, weighted_interleave_mode_show, > + weighted_interleave_mode_store); > + > static void sysfs_wi_node_release(struct iw_node_attr *node_attr, > struct kobject *parent) > { > @@ -3473,6 +3629,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj) > return 0; > } > > +static struct attribute *wi_default_attrs[] = { > + &wi_attr.attr, > + NULL > +}; > + > +static const struct attribute_group wi_attr_group = { > + .attrs = wi_default_attrs, > +}; > + > static int add_weighted_interleave_group(struct kobject *root_kobj) > { > struct kobject *wi_kobj; > @@ -3489,6 +3654,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj) > return err; > } > > + err = sysfs_create_group(wi_kobj, &wi_attr_group); > + if (err) { > + pr_err("failed to add sysfs [auto]\n"); > + kobject_put(wi_kobj); > + return err; > + } > + > for_each_node_state(nid, N_POSSIBLE) { > err = add_weight_node(nid, wi_kobj); > if (err) { --- Best Regards, Huang, Ying