Re: [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 29 Jan 2025 16:32:29 -0800

On Fri, Jan 24, 2025 at 2:45 PM Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote:
>
> On 2025/1/24 04:52, Alexei Starovoitov wrote:
> > On Thu, Jan 16, 2025 at 11:47 AM Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote:
> >>
> >> This patch modifies SCX to use BPF capabilities.
> >>
> >> Make all SCX kfuncs register to BPF capabilities instead of
> >> BPF_PROG_TYPE_STRUCT_OPS.
> >>
> >> Add bpf_scx_bpf_capabilities_adjust as bpf_capabilities_adjust
> >> callback function.
> >>
> >> Signed-off-by: Juntong Deng <juntong.deng@xxxxxxxxxxx>
> >> ---
> >>   kernel/sched/ext.c | 74 ++++++++++++++++++++++++++++++++++++++--------
> >>   1 file changed, 62 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> >> index 7fff1d045477..53cc7c3ed80b 100644
> >> --- a/kernel/sched/ext.c
> >> +++ b/kernel/sched/ext.c
> >> @@ -5765,10 +5765,66 @@ bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >>          }
> >>   }
> >
> > 'capabilities' name doesn't fit.
> > The word already has its meaning in the kernel.
> > It cannot be reused for a different purpose.
> >
> >> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
> >> +                                          u32 context_info, bool enter)
> >> +{
> >> +       if (enter) {
> >> +               switch (context_info) {
> >> +               case offsetof(struct sched_ext_ops, select_cpu):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, enqueue):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, dispatch):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, running):
> >> +               case offsetof(struct sched_ext_ops, stopping):
> >> +               case offsetof(struct sched_ext_ops, enable):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, init):
> >> +               case offsetof(struct sched_ext_ops, exit):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
> >> +                       break;
> >> +               default:
> >> +                       return -EINVAL;
> >> +               }
> >> +       } else {
> >> +               switch (context_info) {
> >> +               case offsetof(struct sched_ext_ops, select_cpu):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, enqueue):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, dispatch):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, running):
> >> +               case offsetof(struct sched_ext_ops, stopping):
> >> +               case offsetof(struct sched_ext_ops, enable):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, init):
> >> +               case offsetof(struct sched_ext_ops, exit):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
> >> +                       break;
> >> +               default:
> >> +                       return -EINVAL;
> >> +               }
> >> +       }
> >> +       return 0;
> >> +}
> >
> > and this callback defeats the whole point of u32 bitmask.
> >
>
> Yes, you are right, I agree that procedural callbacks defeat the purpose
> of BPF capabilities.
>
> > In earlier patch
> > env->context_info = __btf_member_bit_offset(t, member) / 8; // moff
> >
> > is also wrong.
> > The context_info name is too generic and misleading.
> > and 'env' isn't a right place to save moff.
> >
> > Let's try to implement what was discussed earlier:
> >
> > 1
> > After successful check_struct_ops_btf_id() save moff in
> > prog->aux->attach_st_ops_member_off.
> >
> > 2
> > Add .filter callback to sched-ext kfunc registration path and
> > let it allow/deny kfuncs based on st_ops attach point.
> >
> > 3
> > Remove scx_kf_allow() and current->scx.kf_mask.
> >
> > That will be a nice perf win and will prove that
> > this approach works end-to-end.
>
> I am trying, but I found a problem (bug?) when I added test cases
> to bpf_testmod.c.
>
> Filters currently do not work with kernel modules.
>
> Filters rely heavily on (bpf_fs_kfunc_set_ids as an example)
>
> if (!btf_id_set8_contains(&bpf_fs_kfunc_set_ids, kfunc_id)
>
> exclude kfuncs that are not part of its own set
> (__btf_kfunc_id_set_contains performs all the filters for each kfunc),
> otherwise it will result in false rejects.
>
> But this method cannot be used in kernel modules because the BTF ids of
> all kfuncs are relocated.
>
> The BTF ids of all kfuncs in the kernel module will be relocated by
> btf_relocate_id in btf_populate_kfunc_set.
>
> This results in the kfunc_id passed into the filter being different from
> the BTF id in set_ids.
>
> One possible solution is to export btf_relocate_id and
> btf_get_module_btf, and let the kernel module do the relocation itself.
>
> But I am not sure exporting them is a good idea.
>
> Do you have any suggestions?

That's not a problem to fix today.
Currently only sched-ext needs this new ->filter() logic
and it's builtin.
Let's prototype the steps 1,2,3 end-to-end and see whether
anything big is missing.
The lack of bpf_testmod can be addressed later if the whole
approach works for sched-ext.

> In addition, BTF_KFUNC_FILTER_MAX_CNT is currently 16, which is not a
> large enough size.
>
> If we use filters to enforce restrictions on struct_ops for different
> contexts, then each different context needs a filter.
>
> All filters for scenarios using struct_ops (SCX, HID, TCP congestion,
> etc.) are placed in the same struct btf_kfunc_hook_filter
> (filters array).
>
> It is foreseeable that the 16 slots will be exhausted soon.
>
> Should we change it to a linked list?

No. Don't fix what is not broken.
We have a concrete run-time inefficiency on sched-ext side
let's address that first and deal with everything later.
Not the other way around.