Re: [PATCH v2 2/2] mm,nodemask: Use nr_node_ids

Yury Norov <yury.norov@xxxxxxxxx> · Wed, 2 Aug 2023 17:45:44 -0700

+ Linus, Mateusz

On Wed, Aug 02, 2023 at 09:36:16PM +0200, Peter Zijlstra wrote:
> 
> Just like how cpumask uses nr_cpu_ids to limit the bitmap scanning,
> make nodemask use nr_node_ids.
> 
> Since current users expect MAX_NUMNODES as the end-of-bitmap marker,
> retain this behaviour for now.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Reviewed-by: Mike Rapoport (IBM) <rppt@xxxxxxxxxx>
> ---
> Changes since v1:
>  - updated and reflowed the 'borrowed' comment some more (rppt)

Hi Peter,

Thanks for the patch! I wanted to do it sooner or later.

Can you mention the commit that you used to borrow the approach.
Maybe suggested-by?

The motivation for the original patch was that the following test
revealed broken small_const_nbits() optimization for cpumasks:

  On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
  >
  > as an example here is a one-liner to show crappers which do 0-sized ops:
  > bpftrace -e 'kprobe:memset,kprobe:memcpy /arg2 == 0/ { @[probe,
  > kstack(2)] = count(); }'

See:
https://lore.kernel.org/lkml/CAHk-=wgfNrMFQCFWFtn+UXjAdJAGAAFFJZ1JpEomTneza32A6g@xxxxxxxxxxxxxx/

Can you make sure your patch doesn't brake the test for nodemasks?

>  include/linux/nodemask.h |  121 ++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 89 insertions(+), 32 deletions(-)
> 
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -99,6 +99,48 @@
>  typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
>  extern nodemask_t _unused_nodemask_arg_;
>  
> +#if MAX_NUMNODES > 1
> +extern unsigned int nr_node_ids;
> +#else
> +#define nr_node_ids		1U
> +#endif
> +
> +/*
> + * We have several different "preferred sizes" for the nodemask operations,
> + * depending on operation.
> + *
> + * For example, the bitmap scanning and operating operations have optimized
> + * routines that work for the single-word case, but only when the size is
> + * constant. So if MAX_NUMNODES fits in one single word, we are better off
> + * using that small constant, in order to trigger the optimized bit finding.
> + * That is 'small_nodemask_size'.
> + *
> + * The clearing and copying operations will similarly perform better with a

Copying will not, because there's no nodemask_copy(). :-)

> + * constant size, but we limit that size arbitrarily to four words. We call
> + * this 'large_nodemask_size'.
> + *
> + * Finally, some operations just want the exact limit, either because they set
> + * bits or just don't have any faster fixed-sized versions. We call this just
> + * 'nr_nodemask_bits'.
> + *
> + * Note that these optional constants are always guaranteed to be at least as
> + * big as 'nr_node_ids' itself is, and all our nodemask allocations are at
> + * least that size. The optimization comes from being able to potentially use
> + * a compile-time constant instead of a run-time generated exact number of
> + * nodes.
> + */
> +#if MAX_NUMNODES <= BITS_PER_LONG
> +  #define small_nodemask_bits ((unsigned int)MAX_NUMNODES)
> +  #define large_nodemask_bits ((unsigned int)MAX_NUMNODES)
> +#elif MAX_NUMNODES <= 4*BITS_PER_LONG
> +  #define small_nodemask_bits nr_node_ids
> +  #define large_nodemask_bits ((unsigned int)MAX_NUMNODES)
> +#else
> +  #define small_nodemask_bits nr_node_ids
> +  #define large_nodemask_bits nr_node_ids
> +#endif
> +#define nr_nodemask_bits nr_node_ids

We don't need nr_nodemask_bits. In CPU subsystem nr_cpumask_bits
exists (existed) to support dynamic allocation for cpumask_var_t
if CPUMASK_OFFSTACK is enabled. And it apparently caused troubles.

In nodemasks we don't have an offstack feature, and don't need the
nr_nodemask_bits. Just use nr_node_ids everywhere.

[...]

> -#define nodes_setall(dst) __nodes_setall(&(dst), MAX_NUMNODES)
> +#define nodes_setall(dst) __nodes_setall(&(dst), large_nodemask_bits)
>  static inline void __nodes_setall(nodemask_t *dstp, unsigned int nbits)
>  {
>  	bitmap_fill(dstp->bits, nbits);
>  }

When MAX_NUMNODES <= 4*BITS_PER_LONG, this breaks the rule that all
bits beyond nr_node_ids must be clear. And that in turn may brake
nodemask_weight() and others. Refer to this patch for details and
correct implementation:

63355b9884b ("cpumask: be more careful with 'cpumask_setall()'")

[...]

> @@ -452,7 +511,6 @@ static inline unsigned int next_memory_n
>  	return next_node(nid, node_states[N_MEMORY]);
>  }
>  
> -extern unsigned int nr_node_ids;
>  extern unsigned int nr_online_nodes;
>  
>  static inline void node_set_online(int nid)
> @@ -494,7 +552,6 @@ static inline int num_node_state(enum no
>  #define first_memory_node	0
>  #define next_online_node(nid)	(MAX_NUMNODES)
>  #define next_memory_node(nid)	(MAX_NUMNODES)
> -#define nr_node_ids		1U
>  #define nr_online_nodes		1U

I like how you separated the nr_node_ids from the other ifdefery, and
changed it to __ro_after_init. But I think it's better to fold this all
into the 1st patch.

Why don't we make nr_cpu_ids to be a __ro_after_init just as well?

Thanks,
Yury