On Fri, Feb 07, 2025 at 12:30:37PM -1000, Tejun Heo wrote: > Hello, > > On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote: > > +/* > > + * cpumasks to track idle CPUs within each NUMA node. > > + * > > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask > > + * from is used to track all the idle CPUs in the system. > > + */ > > +struct idle_cpus { > > cpumask_var_t cpu; > > cpumask_var_t smt; > > -} idle_masks CL_ALIGNED_IF_ONSTACK; > > +}; > > Can you prefix the type name with scx_? > > Unrelated to this series but I wonder whether we can replace "smt" with > "core" in the future to become more consistent with how the terms are used > in the kernel: > > struct scx_idle_masks { > cpumask_var_t cpus; > cpumask_var_t cores; > }; > > We expose "smt" name through kfuncs but we can rename that to "core" through > compat macros later too. > > > +/* > > + * Find the best idle CPU in the system, relative to @node. > > + */ > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > +{ > > + nodemask_t unvisited = NODE_MASK_ALL; > > + s32 cpu = -EBUSY; > > + > > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > > + return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags); > > + > > + /* > > + * If an initial node is not specified, start with the current > > + * node. > > + */ > > + if (node == NUMA_NO_NODE) > > + node = numa_node_id(); > > + > > + /* > > + * Traverse all nodes in order of increasing distance, starting > > + * from @node. > > + * > > + * This loop is O(N^2), with N being the amount of NUMA nodes, > > + * which might be quite expensive in large NUMA systems. However, > > + * this complexity comes into play only when a scheduler enables > > + * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU > > + * without specifying a target NUMA node, so it shouldn't be a > > + * bottleneck is most cases. > > + * > > + * As a future optimization we may want to cache the list of hop > > + * nodes in a per-node array, instead of actually traversing them > > + * every time. > > + */ > > + for_each_numa_node(node, unvisited, N_POSSIBLE) { > > + cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); > > Maybe rename pick_idle_cpu_in_node() to stay in sync with > SCX_PICK_IDLE_IN_NODE? It's not like pick_idle_cpu_from_node() walks from > the node, right? It just picks within the node. > > > @@ -460,38 +582,50 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool > > > > void scx_idle_reset_masks(void) > > { > > + int node; > > + > > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) { > > + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->cpu, cpu_online_mask); > > + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->smt, cpu_online_mask); > > + return; > > + } > > + > > /* > > * Consider all online cpus idle. Should converge to the actual state > > * quickly. > > */ > > - cpumask_copy(idle_masks.cpu, cpu_online_mask); > > - cpumask_copy(idle_masks.smt, cpu_online_mask); > > -} > > + for_each_node(node) { > > + const struct cpumask *node_mask = cpumask_of_node(node); > > + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; > > + struct cpumask *idle_smts = idle_cpumask(node)->smt; > > -void scx_idle_init_masks(void) > > -{ > > - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); > > - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); > > + cpumask_and(idle_cpus, cpu_online_mask, node_mask); > > + cpumask_copy(idle_smts, idle_cpus); > > + } > > nitpick: Maybe something like the following is more symmetric with the > global case and easier to read? > > for_each_node(node) { > const struct cpumask *node_mask = cpumask_of_node(node); > cpumask_and(idle_cpumask(node)->cpu, cpu_online_mask, node_mask); > cpumask_and(idle_cpumask(node)->smt, cpu_online_mask, node_mask); > } > > > } > > > > static void update_builtin_idle(int cpu, bool idle) > > { > > - assign_cpu(cpu, idle_masks.cpu, idle); > > + int node = idle_cpu_to_node(cpu); Ok to all of the above. > > minor: I wonder whether idle_cpu_to_node() name is a bit confusing - why > does a CPU being idle have anything to do with its node mapping? If there is > a better naming convention, great. If not, it is what it is. Maybe scx_cpu_to_node()? At the end it's just a wrapper to cpu_to_node(), but from the scx perspective, so if NUMA-awareness is not enabled in scx, it'd return NUMA_NO_NODE. Thanks, -Andrea