Hello, On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote: > +/* > + * cpumasks to track idle CPUs within each NUMA node. > + * > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask > + * from is used to track all the idle CPUs in the system. > + */ > +struct idle_cpus { > cpumask_var_t cpu; > cpumask_var_t smt; > -} idle_masks CL_ALIGNED_IF_ONSTACK; > +}; Can you prefix the type name with scx_? Unrelated to this series but I wonder whether we can replace "smt" with "core" in the future to become more consistent with how the terms are used in the kernel: struct scx_idle_masks { cpumask_var_t cpus; cpumask_var_t cores; }; We expose "smt" name through kfuncs but we can rename that to "core" through compat macros later too. > +/* > + * Find the best idle CPU in the system, relative to @node. > + */ > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > +{ > + nodemask_t unvisited = NODE_MASK_ALL; > + s32 cpu = -EBUSY; > + > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > + return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags); > + > + /* > + * If an initial node is not specified, start with the current > + * node. > + */ > + if (node == NUMA_NO_NODE) > + node = numa_node_id(); > + > + /* > + * Traverse all nodes in order of increasing distance, starting > + * from @node. > + * > + * This loop is O(N^2), with N being the amount of NUMA nodes, > + * which might be quite expensive in large NUMA systems. However, > + * this complexity comes into play only when a scheduler enables > + * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU > + * without specifying a target NUMA node, so it shouldn't be a > + * bottleneck is most cases. > + * > + * As a future optimization we may want to cache the list of hop > + * nodes in a per-node array, instead of actually traversing them > + * every time. > + */ > + for_each_numa_node(node, unvisited, N_POSSIBLE) { > + cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); Maybe rename pick_idle_cpu_in_node() to stay in sync with SCX_PICK_IDLE_IN_NODE? It's not like pick_idle_cpu_from_node() walks from the node, right? It just picks within the node. > @@ -460,38 +582,50 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool > > void scx_idle_reset_masks(void) > { > + int node; > + > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) { > + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->cpu, cpu_online_mask); > + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->smt, cpu_online_mask); > + return; > + } > + > /* > * Consider all online cpus idle. Should converge to the actual state > * quickly. > */ > - cpumask_copy(idle_masks.cpu, cpu_online_mask); > - cpumask_copy(idle_masks.smt, cpu_online_mask); > -} > + for_each_node(node) { > + const struct cpumask *node_mask = cpumask_of_node(node); > + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; > + struct cpumask *idle_smts = idle_cpumask(node)->smt; > -void scx_idle_init_masks(void) > -{ > - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); > - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); > + cpumask_and(idle_cpus, cpu_online_mask, node_mask); > + cpumask_copy(idle_smts, idle_cpus); > + } nitpick: Maybe something like the following is more symmetric with the global case and easier to read? for_each_node(node) { const struct cpumask *node_mask = cpumask_of_node(node); cpumask_and(idle_cpumask(node)->cpu, cpu_online_mask, node_mask); cpumask_and(idle_cpumask(node)->smt, cpu_online_mask, node_mask); } > } > > static void update_builtin_idle(int cpu, bool idle) > { > - assign_cpu(cpu, idle_masks.cpu, idle); > + int node = idle_cpu_to_node(cpu); minor: I wonder whether idle_cpu_to_node() name is a bit confusing - why does a CPU being idle have anything to do with its node mapping? If there is a better naming convention, great. If not, it is what it is. Thanks. -- tejun