On Tue, Oct 15, 2024 at 2:18 PM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, > > On Thu, Oct 10, 2024 at 09:12:19PM -0700, Yonghong Song wrote: > > > Let's get priv_stack in shape first (the first ~6 patches). > > > > I am okay to focus on the first 6 patches. But I would like to get > > Tejun's comments about what is the best way to support hierarchical > > bpf based scheduler. > > There isn't a concrete design yet, so it's difficult to say anything > definitive but I was thinking more along the line of providing sched_ext > kfunc helpers that perform nesting calls rather than each BPF program > directly calling nested BPF programs. > > For example, let's say the scheduler hierarchy looks like this: > > R + A + AA > | + AB > + B > > Let's say AB has a task waking up to it and is calling ops.select_cpu(): > > ops.select_cpu() > { > if (does AB already have the perfect CPU sitting around) > direct dispatch and return the CPU; > if (scx_bpf_get_cpus(describe the perfect CPU)) > direct dispatch and return the CPU; > if (is there any eligible idle CPU that AB is holding) > direct dispatch and return the CPU; > if (scx_bpf_get_cpus(any eligible CPUs)) > direct dispatch and return the CPU; > // no idle CPU, proceed to enqueue > return prev_cpu; > } > > Note that the scheduler at AB doesn't have any knowledge of what's up the > tree. It's just describing what it wants through the kfunc which is then > responsible for nesting calls up the hierarhcy. Up a layer, this can be > implemented like: > > ops.get_cpus(CPUs description) > { > if (has any CPUs matching the description) > claim and return the CPUs; > modify CPUs description to enforce e.g. cache sharing policy; > and possibly to request more CPUs for batching; > if (scx_bpf_get_cpus(CPUs description)) { > store extra CPUs; > claim and return some of the CPUs; > } > return no CPUs available; > } > > This way, the schedulers at different layers are isolated and each only has > to express what it wants. What we've been discussing is something like this: ops.get_cpus -> bpf prog A -> kfunc where kfunc will call one of struct_ops callback which may call bpf prog A again, since it's the only one attached to this get_cpus callback. So ops.get_cpus -> bpf prog A -> kfunc -> ops.get_cpus -> bpf prog A. If kfunc calls a different struct_ops callback it will call a different bpf prog B and it will have its own private stack. During struct_ops registration one of bpf_verifier_ops() callbacks like bpf_scx_check_member (or a new callback) will indicate back to bpf trampoline that limited recursion for a specific ops.get_cpus is allowed. Then bpf trampoline's bpf_trampoline_enter() selector will pick an entry helper that allows limited recursion. Currently bpf trampoline doesn't check recursion for struct_ops progs, so it needs to be tightened to allow limited recursion and to let bpf jit prologue know which part of priv stack to use.