Re: [PATCH v5] mm/mempolicy: Weighted Interleave Auto-tuning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 12 Feb 2025 10:49:32 +0800 "Huang, Ying" <ying.huang@xxxxxxxxxxxxxxxxx> wrote:

> Hi, Joshua,
> 
> Thanks for your patch and sorry for late reply.

Hi Ying, no worries! Thank you for taking time to review this patch.

> Joshua Hahn <joshua.hahnjy@xxxxxxxxx> writes:

[...snip...]

> > This patch introduces an auto-configuration mode for the interleave
> > weights that aims to balance the two goals of setting node weights to be
> > proportional to their bandwidths and keeping the weight values low.
> > In order to perform the weight re-scaling, we use an internal
> > "weightiness" value (fixed to 32) that defines interleave aggression.
> 
> As asked by Andrew in another thread, you may need to make it more
> explicit about why we need this patch.

Yes, I think that makes a lot of sense. Probably some details about
why we use 32 as the default weightiness value will also help with clarity.

[...snip...]

> > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > index 0b7972de04e9..ef43228d135d 100644
> > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > @@ -20,6 +20,38 @@ Description:	Weight configuration interface for nodeN
> >  		Minimum weight: 1
> >  		Maximum weight: 255
> >  
> > -		Writing an empty string or `0` will reset the weight to the
> > -		system default. The system default may be set by the kernel
> > -		or drivers at boot or during hotplug events.
> > +		Writing invalid values (i.e. any values not in [1,255],
> > +		empty string, ...) will return -EINVAL.
> > +
> > +		Changing the weight to a valid value will automatically
> > +		update the system to manual mode as well.
> > +
> > +What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
> > +Date:		February 2025
> > +Contact:	Linux memory management mailing list <linux-mm@xxxxxxxxx>
> > +Description:	Auto-weighting configuration interface
> > +
> > +		Configuration mode for weighted interleave. A 'Y' indicates
> > +		that the system is in auto mode, and a 'N' indicates that
> > +		the system is in manual mode. All other values are invalid.
> > +
> > +		In auto mode, all node weights are re-calculated and overwritten
> > +		(visible via the nodeN interfaces) whenever new bandwidth data
> > +		is made available during either boot or hotplug events.
> > +
> > +		In manual mode, node weights can only be updated by the user.
> > +		Note that nodes that are onlined with previously set / defaulted
> 
> Sorry, I don't understand this well.  What is "defaulted" weights?  Why
> not just "previously set"?

This is just poor word choice on my end. I meant to say that if they had
been using auto mode previously and the node contained system-calculated
weights, then offlining and onlining while in manual mode would bring
back the weights from the system. This is the scenario:

Auto: weights are set to [1,2,3]
Mode changes from auto --> manual
Manual: weights are still [1,2,3]
Manual: update node0 to be [7,2,3]
Offline node2
Manual: weights are still [7,2,3], even though node2 is offline
Online node2
Manual: weights are still [7,2,3]
                               ^ I just wanted to highlight that it will
			         online with a weight of 3, which is a
				 value set by the system.

> > +		weights will inherit those weights. If they were not previously
> > +		set or are onlined with missing bandwidth data, they will be
> > +		defaulted to a weight of 1.
> > +
> > +		Writing Y or 1 to the interface will enable auto mode, while
> > +		writing N or 0 will enable manual mode. All other strings will
> > +		be ignored, and -EINVAL will be returned.
> > +
> > +		If Y or 1 is written to the interface but the recalculation or
> > +		updates fail at any point (-ENOMEM or -ENODEV), then the system
> > +		will remain in manual mode.
> 
> IMHO, it's unnecessary to make this a part of the interface definition.
> However, I have no strong opinion on this.

Thank you for your input. I was just trying to be as explicit and
thorough as possible, but I can see how this might feel a bit cluttered.

> > +		Writing a new weight to a node directly via the nodeN interface
> > +		will also automatically update the system to manual mode.
> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > index 80a3481c0470..cc94cba112dd 100644
> > --- a/drivers/acpi/numa/hmat.c
> > +++ b/drivers/acpi/numa/hmat.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/list_sort.h>
> >  #include <linux/memregion.h>
> >  #include <linux/memory.h>
> > +#include <linux/mempolicy.h>
> >  #include <linux/mutex.h>
> >  #include <linux/node.h>
> >  #include <linux/sysfs.h>
> 
> This change should be removed?

Ah I see. Harry (Hyeonggon) had pointed this out in v4 of the patch,
but I misread the response because I thought his comment was about
the include in node.c, not in hmat.c. Sorry, Harry! This is correct;
this include statement should not be here. 

[...snip...]

> >  struct mempolicy {};
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 04f35659717a..51edd3663667 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -109,6 +109,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/printk.h>
> >  #include <linux/swapops.h>
> > +#include <linux/gcd.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> > @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
> >  
> >  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
> >  
> > +static uint64_t *node_bw_table;
> 
> Usually, we use "u64" instead of "uint64_t" in kernel.  It appears that
> we don't need u64. "unsigned int" should be enough because "struct
> access_coordinate" uses that too.
> 
> It's better to move this after iw_table definition below.  And make it
> explicit that iw_table_lock protects all interleave weight global
> variables.

I see, that makes sense. I will change it to unsigned int and move
it below, I agree that the scope of the lock should be clearer as well.

> > +
> >  /*
> > - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> > - * system-default value should be used. A NULL iw_table also denotes that
> > - * system-default values should be used. Until the system-default table
> > - * is implemented, the system-default is always 1.
> > - *
> > + * iw_table is the interleave weight table.
> > + * If bandwidth data is available and the user is in auto mode, the table
>                                              ~~~~
>                                              system?

Thank you, I'll fix this.

> > + * is populated with default values in [1,255].
> 
> If system is in manual mode, the weight value is in [1, 255] too?
> 
> NULL iw_table is still a valid configuration, so we still need to
> describe its behavior.  Right?

Yes, that is true. I just wanted to explicilty note that the default
values selected by the system in auto mode will be in that range.

I'll also include what happens when iw_table is null. It only happens
when there is absolutely no bandwidth information available, so I
will note that in the description as well.

> >   * iw_table is RCU protected
> >   */
> >  static u8 __rcu *iw_table;
> >  static DEFINE_MUTEX(iw_table_lock);
> > +static const int weightiness = 32;
> > +static bool weighted_interleave_auto = true;
> >  
> >  static u8 get_il_weight(int node)
> >  {
> > @@ -156,14 +159,114 @@ static u8 get_il_weight(int node)
> >  
> >  	rcu_read_lock();
> >  	table = rcu_dereference(iw_table);
> > -	/* if no iw_table, use system default */
> 
> It appears that this applies at some degree.

Yes -- but this is only possible if there is no bandwidth data available
(and therefore, mempolicy_set_node_perf is never called). If this is
the case, then all nodes will have the same weight. Setting the default
weight to 1 means weighted interleave is reduced to unweighted interleave,
but setting the default weight to 0 will mean no memory can be allocated.

[...snip...]

> > +	if (old_bw)
> > +		memcpy(new_bw, old_bw, nr_node_ids * sizeof(uint64_t));
> > +	new_bw[node] = bw_val;
> > +	node_bw_table = new_bw;
> > +
> > +	if (weighted_interleave_auto) {
> > +		reduce_interleave_weights(new_bw, new_iw);
> > +	} else if (old_iw) {
> > +		/*
> > +		 * The first time mempolicy_set_node_perf is called, old_iw
> > +		 * (iw_table) is null. If that is the case, assign a zeroed
> > +		 * table to it. Otherwise, free the newly allocated iw_table.
> 
> We shouldn't assign a zeroed iw_table?  Because 0 isn't valid now.
> Please check changes in weighted_interleave_nid() below.

Thank you for the catch! I will fix this and write that it will contain
default values of 1 instead!.

> > +		 */
> > +		mutex_unlock(&iw_table_lock);
> > +		kfree(new_iw);
> > +		kfree(old_bw);
> > +		return 0;
> > +	}
> > +
> > +	rcu_assign_pointer(iw_table, new_iw);
> > +	mutex_unlock(&iw_table_lock);
> > +	synchronize_rcu();
> > +	kfree(old_iw);
> 
> If you want to save one synchronize_rcu() if unnecessary, you can use
> 
>         if (old_iw) {
>                 synchronize_rcu();
>                 kfree(old_iw);
>         }

I see, that makes sense. Thank you!

[...snip...]

> > +static ssize_t weighted_interleave_mode_show(struct kobject *kobj,
> 
> IMHO, it's better to name the function as
> weighted_interleave_auto_show()?

That makes sense, this was an artifact of when we named the interface
to be "mode", but it seems like I overlooked renaming this function.

> > +		struct kobj_attribute *attr, char *buf)
> > +{
> > +	if (weighted_interleave_auto)
> > +		return sysfs_emit(buf, "Y\n");
> > +	else
> > +		return sysfs_emit(buf, "N\n");
> 
> str_true_false() can be used here.

Thank you!

> > +}
> > +
> > +static ssize_t weighted_interleave_mode_store(struct kobject *kobj,
> > +		struct kobj_attribute *attr, const char *buf, size_t count)
> > +{
> > +	uint64_t *bw;
> > +	u8 *old_iw, *new_iw;
> > +
> > +	if (count == 0)
> > +		return -EINVAL;
> > +
> > +	if (sysfs_streq(buf, "N") || sysfs_streq(buf, "0")) {
> 
> kstrtobool() can be used here.  It can deal with 'count == 0' case too.

These kernel string tools are very helpful, thank you for bringing
them to my attention : -)

> > +		weighted_interleave_auto = false;
> > +		return count;
> > +	} else if (!sysfs_streq(buf, "Y") && !sysfs_streq(buf, "1")) {
> > +		return -EINVAL;
> > +	}
> > +
> > +	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
> > +	if (!new_iw)
> > +		return -ENOMEM;
> > +
> > +	mutex_lock(&iw_table_lock);
> > +	bw = node_bw_table;
> > +
> > +	if (!bw) {
> > +		mutex_unlock(&iw_table_lock);
> > +		kfree(new_iw);
> > +		return -ENODEV;
> > +	}
> > +
> > +	old_iw = rcu_dereference_protected(iw_table,
> > +					   lockdep_is_held(&iw_table_lock));
> > +
> > +	reduce_interleave_weights(bw, new_iw);
> > +	rcu_assign_pointer(iw_table, new_iw);
> > +	mutex_unlock(&iw_table_lock);
> > +
> > +	synchronize_rcu();
> > +	kfree(old_iw);
> > +
> > +	weighted_interleave_auto = true;
> 
> Why assign weighted_interleave_auto after synchronize_rcu()?  To reduce
> the race window, it's better to change weighted_interleave_auto and
> iw_table together?  Is it better to put them into a data structure and
> change them together always?
> 
>         struct weighted_interleave_state {
>                 bool weighted_interleave_auto;
>                 u8 iw_table[0]
>         };

I see, I think your explanation makes sense. For the first question,
I think your point makes sense, so I will move the updating to be
inside the rcu section.

As for the combined data structure, I think that this makes sense,
but I have a few thoughts. First, there are some times when we don't
update both of them, like moving from auto --> manual, and whenever
we just update iw_table, we don't need to update the weighted_interleave
auto field. I also have a concern that this might make the code a bit
harder to read, but that is just my humble opinion.

> ---
> Best Regards,
> Huang, Ying

Thank you Ying for your detailed comments! It seems like there are a lot
of things that I missed, thank you for reviewing my patch and catching
the mistakes. Please let me know if there are any other changes that you
see fit for the patch. Have a great day!

Joshua

Sent using hkml (https://github.com/sjp38/hackermail)





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux