On 2025/1/9 19:23, Alexei Starovoitov wrote:
On Mon, Dec 23, 2024 at 4:51 PM Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote:
The main goal is to get rid of run-time mask check in SCX_CALL_OP() and
make it static by the verifier. To make that happen scx_kf_mask flags
would need to become KF_* flags while each struct-ops callback will
specify the expected mask.
Then at struct-ops prog attach time the verifier will see the expected mask
and can check that all kfuncs calls of this particular program
satisfy the mask. Then all of the runtime overhead of
current->scx.kf_mask and scx_kf_allowed() will go away.
Thanks for pointing this out.
Yes, I am interested in working on it.
I will try to solve this problem in a separate patch series.
The following are my thoughts:
Should we really use KF_* to do this? I think KF_* is currently more
like declaring that a kfunc has some kind of attribute, e.g.
KF_TRUSTED_ARGS means that the kfunc only accepts trusted arguments,
rather than being used to categorise kfuncs.
It is not sustainable to restrict the kfuncs that can be used based on
program types, which are coarse-grained. This problem will get worse
as kfuncs increase.
In my opinion, managing the kfuncs available to bpf programs should be
implemented as capabilities. Capabilities are a mature permission model.
We can treat a set of kfuncs as a capability (like the various current
kfunc_sets, but the current kfunc_sets did not carefully divide
permissions).
We should use separate BPF_CAP_XXX flags to manage these capabilities.
For example, SCX may define BPF_CAP_SCX_DISPATCH.
For program types, we should divide them into two levels, types and
subtypes. Types are used to register common capabilities and subtypes
are used to register specific capabilities. The verifier can check if
the used kfuncs are allowed based on the type and subtype of the bpf
program.
I understand that we need to maintain backward compatibility to
userspace, but capabilities are internal changes in the kernel.
Perhaps we can make the current program types as subtypes and
add 'types' that are only used internally, and more subtypes
(program types) can be added in the future.
Sorry for the delay.
imo CAP* approach doesn't fit.
caps are security bits exposed to user space.
Here there is no need to expose anything to user space.
But you're also correct that we cannot extend kfunc KF_* flags
that easily. KF_* flags are limited to 32-bit and we're already
using 12 bits.
enum scx_kf_mask needs 5 bits, so we can squeeze them into
the current 32-bit field _for now_,
but eventually we'd need to refactor kfunc definition into a wider set:
BTF_ID_FLAGS(func, .. KF_*)
so that different struct_ops consumers can define their own bits.
Right now SCX is the only st_ops consumer who needs this feature,
so let's squeeze into the existing KF facility.
First step is to remap scx_kf_mask bits into unused bits in KF_
and annotate corresponding sched-ext kfuncs with it.
For example:
SCX_KF_DISPATCH will become
KF_DISPATCH (1 << 13)
and all kfuncs that are allowed to be called from ->dispatch() callback
will be annotated like:
- BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
- BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
- BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
+ BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
+ BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots, KF_DISPATCH)
+ BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel, KF_DISPATCH)
For sched_ext_ops callback annotations, I think,
the simplest approach is to add special
BTF_SET8_START(st_ops_flags)
BTF_ID_FLAGS(func, sched_ext_ops__dispatch, KF_DISPATCH)
and so on for other ops stubs.
sched_ext_ops__dispatch() is an empty function that
exists in the vmlinux, and though it's not a kfunc
we can use it to annotate
(struct sched_ext_ops *)->dispatch() callback
with a particular KF_ flag
(or a set of flags for SCX_KF_RQ_LOCKED case).
Then the verifier (while analyzing the program that is targeted
to be attach to this ->dispatch() hook)
will check this extra KF flag in st_ops
and will only allow to call kfuncs with matching flags:
if (st_ops->kf_mask & kfunc->kf_mask) // ok to call kfunc from this callback
The end result current->scx.kf_mask will be removed
and instead of run-time check it will become static verifier check.
Sorry, I may not have explained my idea clearly.
The "capabilities" I mentioned have nothing to do with userspace.
The "capabilities" I mentioned are conceptual, not referring to the
capabilities in the Linux.
My idea is that a similar "capabilities" mechanism should be used
inside the BPF subsystem (separate).
I think the essence of the problem is that ONE bpf program type can
be used in MANY different contexts, but different contexts can have
different restrictions.
It is reasonable for one bpf program type to be used in different
contexts. There is no need for one bpf program type to be used
in only one context.
But currently the "permission" management of the BPF subsystem is
completely based on the bpf program type, which is a coarse-grained
model (for example, what kfuncs are allowed to be used, which can
be considered as permissions).
As BPF is used in more and more scenarios, and as one bpf program type
is used in more and more different scenarios, the coarse-grained problem
starts to emerge. It is difficult to divide permissions in different
contexts based on a coarse-grained permission model.
This is why I said that the BPF subsystem should have its own
"capabilities" (again, not part of Linux capabilities, and nothing
to do with userspace).
In my opinion, we should separate permission management from bpf program
types. We need an extra layer of abstraction so that we can achieve
fine-grained permission management.
The reason why I have the idea of capabilities is because in my opinion,
bpf programs need application-like permissions management in a sense,
because BPF is generic.
When BPF is applied to other subsystems (e.g. scheduling, security,
accessing information from other subsystems), we need something like
capabilities. Each subsystem can define its own set of bpf capabilities
to restrict the features that can be used by bpf programs in different
contexts, so that bpf programs can only use a subset of features.
Another advantage of this approach is that bpf capabilities do not need
to be tightly placed inside the bpf core, people in other subsystems can
define them externally and add bpf capabilities they need (Adding
KF_FLAGS can be considered as modifying the bpf core, right?).
Of course, maybe one day in the future, we may be able to associate bpf
capabilities with Linux capabilities, maybe system administrators can
choose to open only some of bpf features to certain users, and maybe all
of this can be configured through /sys/bpf.
So, how do we implement this in the verifier? I think registering the
bpf capabilities is not an problem, it is consistent with the current
registration of kfuncs to the bpf program type, we still use
struct btf_kfunc_id_set.
The really interesting part is how we allow people from different
subsystems to change the capabilities of the bpf program in different
contexts under the same bpf program type. My idea is to add a new
callback function in struct bpf_verifier_ops, say bpf_ctx_allowed_cap.
We can pass context information to this callback function, and the
person who implements this callback function can decide to adjust the
current capabilities (add or delete) in different contexts. In the
case of bpf_struct_ops, the context information may be "moff".
In my opinion, capabilities are a flexible and extensible approach.
All it takes is adding a layer of abstraction to decouple permissions
from program types.
Of course, there are more other technical details that need to be
figured out, and if you think the bpf capabilities is an interesting
idea worth trying, I will try to write a minimal POC and send an
RFC PATCH.
(Actually I have always wanted to write a POC but in the last two weeks
I have been busy with my university stuff and really didn't have the
time, but now I finally have some time to try it)
Yes, maybe I am a bit too forward thinking, at the moment only SCX has
hit this "wall", and at the moment we can indeed solve it directly with
KF_FLAGS or filters.
But I think the problem will persist and come up again in other
scenarios, and maybe we can try something new and maybe not just
solve the SCX problem.
Does this make sense?
Maybe we can have more discussion.
Many thanks.