Re: [PATCH 1/5] sched/topology: Introduce for_each_numa_node() iterator

Andrea Righi <arighi@xxxxxxxxxx> · Fri, 7 Feb 2025 16:44:07 +0100

Hi Yury,

On Thu, Feb 06, 2025 at 10:57:19PM -0500, Yury Norov wrote:
> On Thu, Feb 06, 2025 at 09:15:31PM +0100, Andrea Righi wrote:
...
> > @@ -261,6 +267,29 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
> >  }
> >  #endif	/* CONFIG_NUMA */
> >  
> > +/**
> > + * for_each_numa_node - iterate over NUMA nodes at increasing hop distances
> > + *                      from a given starting node.
> > + * @node: the iteration variable, representing the current NUMA node.
> > + * @start: the NUMA node to start the iteration from.
> > + * @visited: a nodemask_t to track the visited nodes.
> 
> nit: s/nodemask_t/nodemask

The type is actually nodemask_t, do you think it's better to mention
nodemask instead?

> 
> > + * @state: state of NUMA nodes to iterate.
> > + *
> > + * This macro iterates over NUMA nodes in increasing distance from
> > + * @start_node and yields MAX_NUMNODES when all the nodes have been
> > + * visited.
> > + *
> > + * The difference between for_each_node() and for_each_numa_node() is that
> > + * the former allows to iterate over nodes in no particular order, whereas
> > + * the latter iterates over nodes in increasing order of distance.
> 
> for_each_node iterates them in numerical order. 

Oh yes, much better. :)

> 
> > + *
> > + * Requires rcu_lock to be held.
> > + */
> 
> Please mention complexity here, which is O(N^2). 

Ok. Will also add a comment to describe why it's O(N^2).

> 
> > +#define for_each_numa_node(node, start, visited, state)				\
> > +	for (node = start;							\
> > +	     node != MAX_NUMNODES;						\
> > +	     node = sched_numa_node(&(visited), start, state))
> 
> Please braces around parameters.

Ok.

> 
> 'node < MAX_NUMNODES' is better. It will cost you nothing but will give
> some guarantee that if user passes broken start value, we don't call
> sched_numa_node().

Good point.

> 
> What about:
>         start = -EINVAL;
>         foe_each_numa_node(node, start, visited, N_ONLINE)
>                 do_something(node);
> 
> Whatever garbage user puts in 'start', at the very first iteration,
> do_something() will be executed against it. Offline node, -EINVAL or
> NUMA_NO_NODE are the examples.

So, my idea was actually to use start == NUMA_NO_NODE for all the cases
where the starting node doesn't matter, for example when calling
scx_bpf_pick_idle_cpu(), that doesn't have a prev_cpu context.

Should we implicitly fallback to for_each_node() when
start == NUMA_NO_NODE? And if it's complete garbage maybe just break and
never execute do_something(node)?

> 
> > +
> >  /**
> >   * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
> >   *                          from a given node.
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index da33ec9e94ab2..e1d0a33415fb5 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2183,6 +2183,48 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> >  }
> >  EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
> >  
> > +/**
> > + * sched_numa_node - Find the NUMA node at the closest distance from
> > + *		     node @start.
> > + *
> > + * @visited: a pointer to a nodemask_t representing the visited nodes.
> > + * @start: the node to start the search from.
> 
> Maybe just 'node' The 'start' is only meaningful in the caller. Here
> we don't start, we just look for a node that is the most nearest to a
> given one.

Ok.

> 
> What if NOMA_NO_NODE is passed?

We could return the first non-visited node in numerical order. And still
fallback to for_each_node() from for_each_numa_node() when
start == NUMA_NO_NODE.

> 
> > + * @state: the node state to filter nodes by.
> > + *
> > + * This function iterates over all nodes in the given state and calculates
> > + * the distance to the starting node. It returns the node that is the
> > + * closest in terms of distance that has not already been considered (not
> > + * set in @visited and not the starting node). If the node is found, it is
> > + * marked as visited in the @visited node mask.
> > + *
> > + * Returns the node ID closest in terms of hop distance from the @start
> > + * node, or MAX_NUMNODES if no node is found (or all nodes have been
> > + * visited).
> > + */
> > +int sched_numa_node(nodemask_t *visited, int start, unsigned int state)
> 
> The name is somewhat confusing. Sched_numa_node what? If you're looking
> for a closest node, call your function find_closest_node().
> 
> We already have numa_nearest_node(). The difference between this one
> and what you're implementing here is that you add an additional mask.
> So, I'd suggest something like
> 
>  int numa_nearest_node_andnot(int node, unsigned int state, nodemask_t *mask)
> 
> Also, there's about scheduler her, so I'd suggest to place it next to
> numa_nearest_node() in mm/mempolicy.c.

Makes sense, and I agree that mm/mempolicy.c is a better place for this.

> 
> > +{
> > +	int dist, n, min_node, min_dist;
> > +
> > +	min_node = MAX_NUMNODES;
> > +	min_dist = INT_MAX;
> > +
> > +	/* Find the nearest unvisted node */
> 
> If you name it properly, you don't need to explain your intentions in
> the code. Also, at this level of abstraction, the 'visited' name may
> be too specific. Let's refer to it as just a mask containing nodes
> that user wants to skip for whatever reason. 

Ok.

> 
> 
> > +	for_each_node_state(n, state) {
> > +		if (n == start || node_isset(n, *visited))
> > +			continue;
> 
> Nah.
> 
> 1. Skipping start node is wrong. The very first call should return
> 'start' exactly. Then it will be masked out, so the traversing will
> move forward. 
> 2. This should be for_each_node_state_andnot(n, state, mask).
> 
> This way you'll be able to drop the above condition entirely.

Yeah, I agree, I'll revisit this, also considering the comments above.

> 
> > +		dist = node_distance(start, n);
> > +		if (dist < min_dist) {
> > +			min_dist = dist;
> > +			min_node = n;
> > +		}
> > +	}
> > +	if (min_node != MAX_NUMNODES)
> > +		node_set(min_node, *visited);
> 
> Is it possible to set the 'visited' bit in caller code? This way we'll
> make the function pure, like other bitmap search functions.
> 
> Would something like this work? 
> 
> int numa_nearest_node_andnot(int node, unsigned int state, const nodemask_t *mask);
> #define for_each_numa_node(node, visited, state)			                      \
> 	for (int start = (node), n = numa_nearest_node_andnot(start, (state), &(visited));    \
> 	     n < MAX_NUMNODES;					                      \
>              node_set(n, (visited)), n = numa_nearest_node_andnot(start, (state), &(visited)))
> 
> One other option to think about is that we can introduce numa_nearest_nodemask()
> and use it like this
> 
>   nodemask_t nodes;
> 
>   nodes_complement(nodes, node_states[N_ONLINE];
>   for_each_numa_node(node, unvisited)
>         do_something();
> 
> Where:
> 
> int numa_nearest_nodemask(int node, const nodemask_t *mask);
> #define for_each_numa_node(node, unvisited)			                      \
> 	for (int start = (node), n = numa_nearest_nodemask(start, &(unvisited));    \
> 	     n < MAX_NUMNODES;					                      \
>              node_clear(n, (visited)), n = numa_nearest_nodemask(start, &(visited)))

I like the numa_nearest_nodemask() idea, I'll do some experiemnts with it.

Thanks!
-Andrea