Many scx schedulers define their own concept of scheduling domains to represent topology characteristics, such as heterogeneous architectures (e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on specific properties (e.g., setting the soft-affinity of certain tasks to a subset of CPUs). Currently, there is no mechanism to share these domains with the built-in idle CPU selection policy. As a result, schedulers often implement their own idle CPU selection policies, which are typically similar to one another, leading to a lot of code duplication. To address this, extend the built-in idle CPU selection policy introducing the concept of allowed CPUs. With this concept, BPF schedulers can apply the built-in idle CPU selection policy to a subset of allowed CPUs, allowing them to implement their own scheduling domains while still using the topology optimizations of the built-in policy, preventing code duplication across different schedulers. To implement this introduce a new helper kfunc scx_bpf_select_cpu_and() that accepts a cpumask of allowed CPUs: s32 scx_bpf_select_cpu_and(struct task_struct *p, const struct cpumask *cpus_allowed, s32 prev_cpu, u64 wake_flags, u64 flags); Example usage ============= s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *dom = task_domain(p) ?: p->cpus_ptr; s32 cpu; /* * Pick an idle CPU in the task's domain. */ cpu = scx_bpf_select_cpu_and(p, dom, prev_cpu, wake_flags, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets / 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses the new helper scx_bpf_select_cpu_and() and 0xff00 as allowed domain: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. ChangeLog v1 -> v2: - rename scx_bpf_select_cpu_pref() to scx_bpf_select_cpu_and() and always select idle CPUs strictly within the allowed domain - rename preferred CPUs -> allowed CPU - drop %SCX_PICK_IDLE_IN_PREF (not required anymore) - deprecate scx_bpf_select_cpu_dfl() in favor of scx_bpf_select_cpu_and() and provide all the required backward compatibility boilerplate Andrea Righi (6): sched_ext: idle: Honor idle flags in the built-in idle selection policy sched_ext: idle: Refactor scx_select_cpu_dfl() sched_ext: idle: Introduce the concept of allowed CPUs sched_ext: idle: Introduce scx_bpf_select_cpu_and() selftests/sched_ext: Add test for scx_bpf_select_cpu_and() sched_ext: idle: Deprecate scx_bpf_select_cpu_dfl() Documentation/scheduler/sched-ext.rst | 11 +- kernel/sched/ext.c | 13 +- kernel/sched/ext_idle.c | 243 +++++++++++++++------ kernel/sched/ext_idle.h | 3 +- tools/sched_ext/include/scx/common.bpf.h | 5 +- tools/sched_ext/include/scx/compat.bpf.h | 37 ++++ tools/sched_ext/scx_flatcg.bpf.c | 12 +- tools/sched_ext/scx_simple.bpf.c | 9 +- tools/testing/selftests/sched_ext/Makefile | 1 + .../testing/selftests/sched_ext/allowed_cpus.bpf.c | 91 ++++++++ tools/testing/selftests/sched_ext/allowed_cpus.c | 57 +++++ .../selftests/sched_ext/enq_select_cpu_fails.bpf.c | 12 +- .../selftests/sched_ext/enq_select_cpu_fails.c | 2 +- tools/testing/selftests/sched_ext/exit.bpf.c | 6 +- .../sched_ext/select_cpu_dfl_nodispatch.bpf.c | 13 +- .../sched_ext/select_cpu_dfl_nodispatch.c | 2 +- 16 files changed, 405 insertions(+), 112 deletions(-) create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.bpf.c create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.c