Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning

Joshua Hahn <joshua.hahnjy@xxxxxxxxx> · Fri, 28 Feb 2025 08:24:45 -0800

On Fri, 28 Feb 2025 15:39:55 +0900 Yunjeong Mun <yunjeong.mun@xxxxxx> wrote:

Hi Yunjeong, thank you for taking time to review my work!

> Hi, Joshua. 
> 
> First of all I accidentally sent the wrong email a few hours ago.
> Please disregard it. Sorry for the confusion.

No worries at all!

> On Wed, 26 Feb 2025 13:35:17 -0800 Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
> 
> [...snip...]
> >  
> > +/*
> > + * Convert bandwidth values into weighted interleave weights.
> > + * Call with iw_table_lock.
> > + */
> > +static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> > +{
> > +	u64 sum_bw = 0;
> > +	unsigned int cast_sum_bw, sum_iw = 0;
> > +	unsigned int scaling_factor = 1, iw_gcd = 1;
> > +	int nid;
> > +
> > +	/* Recalculate the bandwidth distribution given the new info */
> > +	for_each_node_state(nid, N_MEMORY)
> > +		sum_bw += bw[nid];
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		/* Set memoryless nodes' weights to 1 to prevent div/0 later */
> > +		if (!node_state(nid, N_MEMORY)) {
> > +			new_iw[nid] = 1;
> > +			continue;
> > +		}
> > +
> > +		scaling_factor = 100 * bw[nid];
> > +
> > +		/*
> > +		 * Try not to perform 64-bit division.
> > +		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> > +		 * If sum_bw > scaling_factor, then bw[nid] is less than
> > +		 * 1% of the total bandwidth. Round up to 1%.
> > +		 */
> > +		if (bw[nid] && sum_bw < scaling_factor) {
> > +			cast_sum_bw = (unsigned int)sum_bw;
> > +			new_iw[nid] = scaling_factor / cast_sum_bw;
> > +		} else {
> > +			new_iw[nid] = 1;
> > +		}
> > +		sum_iw += new_iw[nid];
> > +	}
> > +
> > +	/*
> > +	 * Scale each node's share of the total bandwidth from percentages
> > +	 * to whole numbers in the range [1, weightiness]
> > +	 */
> > +	for_each_node_state(nid, N_MEMORY) {
> > +		scaling_factor = weightiness * new_iw[nid];
> > +		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> > +		if (nid == 0)
> > +			iw_gcd = new_iw[0];
> > +		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> > +	}
> > +
> > +	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
> > +	for_each_node_state(nid, N_MEMORY)
> > +		new_iw[nid] /= iw_gcd;
> > +}
> 
> In my understanding, new_iw[nid] values are scaled twice, first to 100 and then to a 
> weightines value of 32. I think this scaling can be done just once, directly 
> to weightness value as follows:

Yes,  you are correct. I want to provide a bit of context on how this
patch has changed over time: In the first few iterations of this patch, 
"weightiness" was actually exposed as a sysfs interface that users could
change to change how much they scaled for high numbers (better weight
accuracy, but worse page allocation distributon fairness) and small numbers
(bigger errors, but better local fairness).

The reason why this matters is that we use a heuristic of "round all
weights whose weights are less than 1% of the total weight sum to 1%".
So if we have bandwidth ratios of 100 : 1000 : 3000 : 4000 : 6000,
we have a sum total of 14100. Then 100/14100 is only ~0.7%, and we would
want to round it up to 1% before moving on (since weights that are too
small don't end up helping). This problem only gets worse for machines
with more nodes, and it becomes possible for a node to have something like
0.1% of the total bandwidth.

When users could set weightiness to be up to 255, this was problematic,
becuase scenarios where weights become 1:255:255:255:255... become possible,
where we allocate a single page from one node, then allocate 255 pages from
the remaining nr_node_ids - 1 nodes (which is of course, not ideal).
However, with weightiness fixed to 32, maybe this heuristic makes less sense,
since the worst-case-scenario looks like 1:32:32:32:32...

I think this proposed change makes a lot of sense. It does seem silly to
have to round twice, and without giving the users the ability to set thier
own weightiness value, rounding just once seems to be enough to prevent
the worst case scenario. I will incorporate this into a v7.

I'm also going to wait a bit for more feedback to come in for this version,
so it may be a little bit before I send v7 out : -)

Thanks again for your review and the proposed change. Have a great day!
Joshua

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 50cbb7c047fa..65a7e2baf161 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -176,47 +176,22 @@ static u8 get_il_weight(int node)
>  static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
>  {
> 	u64 sum_bw = 0;
> -	unsigned int cast_sum_bw, sum_iw = 0;
> -	unsigned int scaling_factor = 1, iw_gcd = 1;
> +	unsigned int scaling_factor = 1, iw_gcd = 0;
> 	int nid;
> 
> 	/* Recalculate the bandwidth distribution given the new info */
> 	for_each_node_state(nid, N_MEMORY)
> 		sum_bw += bw[nid];
> 
> -       for (nid = 0; nid < nr_node_ids; nid++) {
>  			[...snip...]
> -		/*
> -		 * Try not to perform 64-bit division.
> -		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> -		 * If sum_bw > scaling_factor, then bw[nid] is less than
> -		 * 1% of the total bandwidth. Round up to 1%.
> -		 */
>  			[...snip...]
> -		sum_iw += new_iw[nid];
> -	}
> -
>      
> 	/*
> 	 * Scale each node's share of the total bandwidth from percentages
> 	 * to whole numbers in the range [1, weightiness]
> 	 */
> 	for_each_node_state(nid, N_MEMORY) {
> -		scaling_factor = weightiness * new_iw[nid];
> -		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> -		if (nid == 0)
> -			iw_gcd = new_iw[0];
> +		scaling_factor = weightiness * bw[nid];
> +		new_iw[nid] = max(scaling_factor / sum_bw, 1);
> +		if (!iw_gcd)
> +			iw_gcd = new_iw[nid];
> 		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> 	}
> 
> Please let me know how you think about this.
> 
> Best regards,
> Yunjeong

Sent using hkml (https://github.com/sjp38/hackermail)