On Fri, Jan 24, 2025 at 2:45 PM Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote: > > On 2025/1/24 04:52, Alexei Starovoitov wrote: > > On Thu, Jan 16, 2025 at 11:47 AM Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote: > >> > >> This patch modifies SCX to use BPF capabilities. > >> > >> Make all SCX kfuncs register to BPF capabilities instead of > >> BPF_PROG_TYPE_STRUCT_OPS. > >> > >> Add bpf_scx_bpf_capabilities_adjust as bpf_capabilities_adjust > >> callback function. > >> > >> Signed-off-by: Juntong Deng <juntong.deng@xxxxxxxxxxx> > >> --- > >> kernel/sched/ext.c | 74 ++++++++++++++++++++++++++++++++++++++-------- > >> 1 file changed, 62 insertions(+), 12 deletions(-) > >> > >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > >> index 7fff1d045477..53cc7c3ed80b 100644 > >> --- a/kernel/sched/ext.c > >> +++ b/kernel/sched/ext.c > >> @@ -5765,10 +5765,66 @@ bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > >> } > >> } > > > > 'capabilities' name doesn't fit. > > The word already has its meaning in the kernel. > > It cannot be reused for a different purpose. > > > >> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities, > >> + u32 context_info, bool enter) > >> +{ > >> + if (enter) { > >> + switch (context_info) { > >> + case offsetof(struct sched_ext_ops, select_cpu): > >> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU); > >> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE); > >> + break; > >> + case offsetof(struct sched_ext_ops, enqueue): > >> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE); > >> + break; > >> + case offsetof(struct sched_ext_ops, dispatch): > >> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH); > >> + break; > >> + case offsetof(struct sched_ext_ops, running): > >> + case offsetof(struct sched_ext_ops, stopping): > >> + case offsetof(struct sched_ext_ops, enable): > >> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST); > >> + break; > >> + case offsetof(struct sched_ext_ops, init): > >> + case offsetof(struct sched_ext_ops, exit): > >> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED); > >> + break; > >> + default: > >> + return -EINVAL; > >> + } > >> + } else { > >> + switch (context_info) { > >> + case offsetof(struct sched_ext_ops, select_cpu): > >> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU); > >> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE); > >> + break; > >> + case offsetof(struct sched_ext_ops, enqueue): > >> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE); > >> + break; > >> + case offsetof(struct sched_ext_ops, dispatch): > >> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH); > >> + break; > >> + case offsetof(struct sched_ext_ops, running): > >> + case offsetof(struct sched_ext_ops, stopping): > >> + case offsetof(struct sched_ext_ops, enable): > >> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST); > >> + break; > >> + case offsetof(struct sched_ext_ops, init): > >> + case offsetof(struct sched_ext_ops, exit): > >> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED); > >> + break; > >> + default: > >> + return -EINVAL; > >> + } > >> + } > >> + return 0; > >> +} > > > > and this callback defeats the whole point of u32 bitmask. > > > > Yes, you are right, I agree that procedural callbacks defeat the purpose > of BPF capabilities. > > > In earlier patch > > env->context_info = __btf_member_bit_offset(t, member) / 8; // moff > > > > is also wrong. > > The context_info name is too generic and misleading. > > and 'env' isn't a right place to save moff. > > > > Let's try to implement what was discussed earlier: > > > > 1 > > After successful check_struct_ops_btf_id() save moff in > > prog->aux->attach_st_ops_member_off. > > > > 2 > > Add .filter callback to sched-ext kfunc registration path and > > let it allow/deny kfuncs based on st_ops attach point. > > > > 3 > > Remove scx_kf_allow() and current->scx.kf_mask. > > > > That will be a nice perf win and will prove that > > this approach works end-to-end. > > I am trying, but I found a problem (bug?) when I added test cases > to bpf_testmod.c. > > Filters currently do not work with kernel modules. > > Filters rely heavily on (bpf_fs_kfunc_set_ids as an example) > > if (!btf_id_set8_contains(&bpf_fs_kfunc_set_ids, kfunc_id) > > exclude kfuncs that are not part of its own set > (__btf_kfunc_id_set_contains performs all the filters for each kfunc), > otherwise it will result in false rejects. > > But this method cannot be used in kernel modules because the BTF ids of > all kfuncs are relocated. > > The BTF ids of all kfuncs in the kernel module will be relocated by > btf_relocate_id in btf_populate_kfunc_set. > > This results in the kfunc_id passed into the filter being different from > the BTF id in set_ids. > > One possible solution is to export btf_relocate_id and > btf_get_module_btf, and let the kernel module do the relocation itself. > > But I am not sure exporting them is a good idea. > > Do you have any suggestions? That's not a problem to fix today. Currently only sched-ext needs this new ->filter() logic and it's builtin. Let's prototype the steps 1,2,3 end-to-end and see whether anything big is missing. The lack of bpf_testmod can be addressed later if the whole approach works for sched-ext. > In addition, BTF_KFUNC_FILTER_MAX_CNT is currently 16, which is not a > large enough size. > > If we use filters to enforce restrictions on struct_ops for different > contexts, then each different context needs a filter. > > All filters for scenarios using struct_ops (SCX, HID, TCP congestion, > etc.) are placed in the same struct btf_kfunc_hook_filter > (filters array). > > It is foreseeable that the 16 slots will be exhausted soon. > > Should we change it to a linked list? No. Don't fix what is not broken. We have a concrete run-time inefficiency on sched-ext side let's address that first and deal with everything later. Not the other way around.