On Mon, Dec 23, 2024 at 04:57:36PM -0800, Yury Norov wrote: > On Fri, Dec 20, 2024 at 04:11:42PM +0100, Andrea Righi wrote: > > Add the following kfunc's to provide scx schedulers direct access to > > per-node idle cpumasks information: > > > > const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) > > const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) > > s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, > > int node, u64 flags) > > int scx_bpf_cpu_to_node(s32 cpu) > > > > Signed-off-by: Andrea Righi <arighi@xxxxxxxxxx> > > --- > > kernel/sched/ext_idle.c | 163 ++++++++++++++++++++--- > > tools/sched_ext/include/scx/common.bpf.h | 4 + > > tools/sched_ext/include/scx/compat.bpf.h | 19 +++ > > 3 files changed, 170 insertions(+), 16 deletions(-) > > > > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c > > index b36e93da1b75..0f8ccc1e290e 100644 > > --- a/kernel/sched/ext_idle.c > > +++ b/kernel/sched/ext_idle.c > > @@ -28,6 +28,60 @@ static bool check_builtin_idle_enabled(void) > > return false; > > } > > > > +static bool check_builtin_idle_per_node_enabled(void) > > +{ > > + if (static_branch_likely(&scx_builtin_idle_per_node)) > > + return true; > > return 0; > > > + > > + scx_ops_error("per-node idle tracking is disabled"); > > + return false; > > return -ENOTSUP; Ok. > > > +} > > + > > +/* > > + * Validate and resolve a NUMA node. > > + * > > + * Return the resolved node ID on success or a negative value otherwise. > > + */ > > +static int validate_node(int node) > > +{ > > + if (!check_builtin_idle_per_node_enabled()) > > + return -EINVAL; > > So the node may be valid, but this validator may fail. EINVAL is a > misleading error code for that. You need ENOTSUP. Ok. > > > + > > + /* If no node is specified, use the current one */ > > + if (node == NUMA_NO_NODE) > > + return numa_node_id(); > > + > > + /* Make sure node is in a valid range */ > > + if (node < 0 || node >= nr_node_ids) { > > + scx_ops_error("invalid node %d", node); > > + return -ENOENT; > > No such file or directory? Hmm... > > This should be EINVAL. I would join this one with node_possible() > check. We probably need bpf_node_possible() or something... Ok about EINVAL. About bpf_node_possible() I'm not sure, it'd be convenient to have a kfunc for the BPF code to validate a node, but then we may also need to introduce bpf_node_online(), or even bpf_node_state(), ...? This can be probably addressed in a separate patch. > > > + } > > + > > + /* Make sure the node is part of the set of possible nodes */ > > + if (!node_possible(node)) { > > + scx_ops_error("unavailable node %d", node); > > Not that it's unavailable. It just doesn't exist... I'd say: > > scx_ops_error("Non-existing node %d. The existing nodes are: %pbl", > node, nodemask_pr_args(node_states[N_POSSIBLE])); > > > + return -EINVAL; > > + } > > What if user provides offline or cpu-less nodes? Is that a normal usage? > If not, it would be nice to print warning, or even return an error... I think we're returning -EBUSY in this case, which might be a reasonable error already. Triggering an scx_ops_error() seems a bit too aggressive. > > > + > > + return node; > > +} > > + > > +/* > > + * Return the node id associated to a target idle CPU (used to determine > > + * the proper idle cpumask). > > + */ > > +static int idle_cpu_to_node(int cpu) > > +{ > > + int node; > > + > > + if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > > + node = cpu_to_node(cpu); > > + else > > + node = NUMA_FLAT_NODE; > > + > > + return node; > > +} > > + > > #ifdef CONFIG_SMP > > struct idle_cpumask { > > cpumask_var_t cpu; > > @@ -83,22 +137,6 @@ static void idle_masks_init(void) > > > > static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc); > > > > -/* > > - * Return the node id associated to a target idle CPU (used to determine > > - * the proper idle cpumask). > > - */ > > -static int idle_cpu_to_node(int cpu) > > -{ > > - int node; > > - > > - if (static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > > - node = cpu_to_node(cpu); > > - else > > - node = NUMA_FLAT_NODE; > > - > > - return node; > > -} > > - > > static bool test_and_clear_cpu_idle(int cpu) > > { > > int node = idle_cpu_to_node(cpu); > > @@ -613,6 +651,17 @@ static void reset_idle_masks(void) {} > > */ > > __bpf_kfunc_start_defs(); > > > > +/** > > + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to > > + */ > > +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu) > > +{ > > + if (cpu < 0 || cpu >= nr_cpu_ids) > > + return -EINVAL; > > + > > + return idle_cpu_to_node(cpu); > > +} > > + > > /** > > * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu() > > * @p: task_struct to select a CPU for > > @@ -645,6 +694,28 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > > return prev_cpu; > > } > > > > +/** > > + * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the idle-tracking > > + * per-CPU cpumask of a target NUMA node. > > + * > > + * NUMA_NO_NODE is interpreted as the current node. > > + * > > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not > > + * valid, or running on a UP kernel. > > + */ > > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) > > +{ > > + node = validate_node(node); > > + if (node < 0) > > + return cpu_none_mask; > > I think I commented this in v7. This simply hides an error. You need to > return ERR_PTR(node). And your user should check it with IS_ERR_VALUE(). > > This should be consistent with scx_bpf_pick_idle_cpu_node(), where you > return an actual error. I think I changed it... somewhere, but it looks like I missed this part. :) Will change this as well, thanks! > > > + > > +#ifdef CONFIG_SMP > > + return get_idle_cpumask(node); > > +#else > > + return cpu_none_mask; > > +#endif > > +} > > + > > /** > > * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking > > * per-CPU cpumask. > > @@ -664,6 +735,32 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) > > return get_idle_cpumask(NUMA_FLAT_NODE); > > } > > > > +/** > > + * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the idle-tracking, > > + * per-physical-core cpumask of a target NUMA node. Can be used to determine > > + * if an entire physical core is free. > > If it goes to DOCs, it should have parameters section. Ok. > > > + * > > + * NUMA_NO_NODE is interpreted as the current node. > > + * > > + * Returns an empty cpumask if idle tracking is not enabled, if @node is not > > + * valid, or running on a UP kernel. > > + */ > > +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) > > +{ > > + node = validate_node(node); > > + if (node < 0) > > + return cpu_none_mask; > > + > > +#ifdef CONFIG_SMP > > + if (sched_smt_active()) > > + return get_idle_smtmask(node); > > + else > > + return get_idle_cpumask(node); > > +#else > > + return cpu_none_mask; > > +#endif > > +} > > + > > /** > > * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, > > * per-physical-core cpumask. Can be used to determine if an entire physical > > @@ -722,6 +819,36 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) > > return false; > > } > > > > +/** > > + * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node > > + * @cpus_allowed: Allowed cpumask > > + * @node: target NUMA node > > + * @flags: %SCX_PICK_IDLE_CPU_* flags > > + * > > + * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node. > > + * Returns the picked idle cpu number on success. -%EBUSY if no matching cpu > > + * was found. > > validate_node() returns more errors. > > > + * > > + * If @node is NUMA_NO_NODE, the search is restricted to the current NUMA > > + * node. Otherwise, the search starts from @node and proceeds to other > > + * online NUMA nodes in order of increasing distance (unless > > + * SCX_PICK_IDLE_NODE is specified, in which case the search is limited to > > + * the target @node). > > Can you reorder statements, like: > > Restricted to current node if NUMA_NO_NODE. > Restricted to @node if SCX_PICK_IDLE_NODE is specified > Otherwise ... > > What if NUMA_NO_NODE + SCX_PICK_IDLE_NODE? Seems to be OK, but looks > redundant and non-intuitive. Why not if NUMA_NO_NODE provided, start > from current node, but not restrict with it? The more I think about NUMA_NO_NODE behavior, the more I'm convinved we should just return -EBUSY (or a similar error). Implicitly assuming NUMA_NO_NODE == current node seems a bit confusing in some cases. Moreover, BPF already has the bpf_get_numa_node_id() helper, so there's no reason to introduce this NUMA_NO_NODE == current node assumption. > > > + * > > + * Unavailable if ops.update_idle() is implemented and > > + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set or if %SCX_OPS_KEEP_BUILTIN_IDLE is > > + * not set. > > + */ > > +__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed, > > + int node, u64 flags) > > +{ > > + node = validate_node(node); > > Hold on! This validate_node() replaces NO_NODE with current node but > doesn't touch flags. It means that scx_pick_idle_cpu() will never see > NO_NODE, and will not be able to restrict to current node. The comment > above is incorrect, right? Yes, the comment is incorrect, the logic here was to simply replace NUMA_NO_NODE with current node, the restriction is only determined by SCX_PICK_IDLE_NODE. However, as mentioned above, I think we should just get rid of this NO_NODE == current node assumption, this is yet another place where it adds unnecessary complexity and it makes the code harder to follow. Thanks, -Andrea