Re: [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 15 Oct 2024 14:35:00 -0700

On Tue, Oct 15, 2024 at 2:18 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Thu, Oct 10, 2024 at 09:12:19PM -0700, Yonghong Song wrote:
> > > Let's get priv_stack in shape first (the first ~6 patches).
> >
> > I am okay to focus on the first 6 patches. But I would like to get
> > Tejun's comments about what is the best way to support hierarchical
> > bpf based scheduler.
>
> There isn't a concrete design yet, so it's difficult to say anything
> definitive but I was thinking more along the line of providing sched_ext
> kfunc helpers that perform nesting calls rather than each BPF program
> directly calling nested BPF programs.
>
> For example, let's say the scheduler hierarchy looks like this:
>
>   R + A + AA
>     |   + AB
>     + B
>
> Let's say AB has a task waking up to it and is calling ops.select_cpu():
>
>  ops.select_cpu()
>  {
>         if (does AB already have the perfect CPU sitting around)
>                 direct dispatch and return the CPU;
>         if (scx_bpf_get_cpus(describe the perfect CPU))
>                 direct dispatch and return the CPU;
>         if (is there any eligible idle CPU that AB is holding)
>                 direct dispatch and return the CPU;
>         if (scx_bpf_get_cpus(any eligible CPUs))
>                 direct dispatch and return the CPU;
>         // no idle CPU, proceed to enqueue
>         return prev_cpu;
>  }
>
> Note that the scheduler at AB doesn't have any knowledge of what's up the
> tree. It's just describing what it wants through the kfunc which is then
> responsible for nesting calls up the hierarhcy. Up a layer, this can be
> implemented like:
>
>  ops.get_cpus(CPUs description)
>  {
>         if (has any CPUs matching the description)
>                 claim and return the CPUs;
>         modify CPUs description to enforce e.g. cache sharing policy;
>         and possibly to request more CPUs for batching;
>         if (scx_bpf_get_cpus(CPUs description)) {
>                 store extra CPUs;
>                 claim and return some of the CPUs;
>         }
>         return no CPUs available;
>  }
>
> This way, the schedulers at different layers are isolated and each only has
> to express what it wants.

What we've been discussing is something like this:

ops.get_cpus -> bpf prog A -> kfunc

where kfunc will call one of struct_ops callback
which may call bpf prog A again, since it's the only one attached
to this get_cpus callback.
So
ops.get_cpus -> bpf prog A -> kfunc -> ops.get_cpus -> bpf prog A.

If kfunc calls a different struct_ops callback it will call
a different bpf prog B and it will have its own private stack.

During struct_ops registration one of bpf_verifier_ops() callbacks
like bpf_scx_check_member (or a new callback) will indicate
back to bpf trampoline that limited recursion for a specific
ops.get_cpus is allowed.
Then bpf trampoline's bpf_trampoline_enter() selector will
pick an entry helper that allows limited recursion.

Currently bpf trampoline doesn't check recursion for struct_ops progs,
so it needs to be tightened to allow limited recursion
and to let bpf jit prologue know which part of priv stack to use.