Re: [PATCH] percpu: reduce the number of searches calculating best upa

Dennis Zhou <dennis@xxxxxxxxxx> · Fri, 6 Nov 2020 23:33:41 +0000

Hi,

On Mon, Nov 02, 2020 at 02:26:47PM +0900, Wonhuyk Yang wrote:
> From: Wonhyuk Yang <vvghjk1234@xxxxxxxxx>
> 
> Best upa is determined by iterating 1 to max_upa. If the size of
> alloc_size is power of 2, numbers of iteration decrease to
> logarithmic level.
> 
> Prime factorization of alloc_size makes it easy to get possible
> upas. When alloc_size is power of 2, we can avoid cost of the
> prime factorization and possible upas are 1, 2, 4, ... max_upa.
> 
> Signed-off-by: Wonhyuk Yang <vvghjk1234@xxxxxxxxx>
> ---
>  mm/percpu.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 66a93f096394..a24f3973744f 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -2689,18 +2689,17 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
>  
>  	/*
>  	 * Determine min_unit_size, alloc_size and max_upa such that
> -	 * alloc_size is multiple of atom_size and is the smallest
> -	 * which can accommodate 4k aligned segments which are equal to
> -	 * or larger than min_unit_size.
> +	 * alloc_size is the maximu value of min_unit_size, atom_size.
> +	 * Also, alloc_size is power of 2 because both min_unit_size
> +	 * and atom_size are power of 2.
>  	 */
>  	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
> +	min_unit_size = roundup_pow_of_two(min_unit_size);

While this may make sense for the vast majority of users, there remain
users such as embedded devices that page in the first chunk and have
fairly limited use of percpu memory. In these cases, we wouldn't want to
round up the min_unit_size as that might be wasteful albeit not much but
still to be not really worth changing the behavior here.

>  
>  	/* determine the maximum # of units that can fit in an allocation */
> -	alloc_size = roundup(min_unit_size, atom_size);
> -	upa = alloc_size / min_unit_size;
> -	while (alloc_size % upa || (offset_in_page(alloc_size / upa)))
> -		upa--;
> -	max_upa = upa;
> +	alloc_size = max_t(size_t, min_unit_size, atom_size);
> +	max_upa = alloc_size / min_unit_size;
> +
>  
>  	/* group cpus according to their proximity */
>  	for_each_possible_cpu(cpu) {
> @@ -2727,12 +2726,9 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
>  	 * Related to atom_size, which could be much larger than the unit_size.
>  	 */
>  	last_allocs = INT_MAX;
> -	for (upa = max_upa; upa; upa--) {
> +	for (upa = max_upa; upa; upa >>= 1) {
>  		int allocs = 0, wasted = 0;
>  
> -		if (alloc_size % upa || (offset_in_page(alloc_size / upa)))
> -			continue;
> -
>  		for (group = 0; group < nr_groups; group++) {
>  			int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
>  			allocs += this_allocs;
> -- 
> 2.17.1
> 

Overall, I'm not inclined to take this because it is trying to optimize
boot time code, which runs at most a few times, by introducing a new
assumption. I personally find this code a little complex to parse, so
I'd rather not make the change unless it aided in maintainability or was
side effect free.

Thanks,
Dennis