On Thu, Sep 14, 2023 at 5:55 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > On Thu, Sep 14, 2023 at 1:31 PM Andrii Nakryiko > <andrii.nakryiko@xxxxxxxxx> wrote: > > On Wed, Sep 13, 2023 at 2:46 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > > > > > On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > > > allow delegating privileged BPF functionality, like loading a BPF > > > > program or creating a BPF map, from privileged process to a *trusted* > > > > unprivileged process, all while have a good amount of control over which > > > > privileged operations could be performed using provided BPF token. > > > > > > > > This is achieved through mounting BPF FS instance with extra delegation > > > > mount options, which determine what operations are delegatable, and also > > > > constraining it to the owning user namespace (as mentioned in the > > > > previous patch). > > > > > > > > BPF token itself is just a derivative from BPF FS and can be created > > > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > > > > a path specification (using the usual fd + string path combo) to a BPF > > > > FS mount. Currently, BPF token "inherits" delegated command, map types, > > > > prog type, and attach type bit sets from BPF FS as is. In the future, > > > > having an BPF token as a separate object with its own FD, we can allow > > > > to further restrict BPF token's allowable set of things either at the creation > > > > time or after the fact, allowing the process to guard itself further > > > > from, e.g., unintentionally trying to load undesired kind of BPF > > > > programs. But for now we keep things simple and just copy bit sets as is. > > > > > > > > When BPF token is created from BPF FS mount, we take reference to the > > > > BPF super block's owning user namespace, and then use that namespace for > > > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > > > capabilities that are normally only checked against init userns (using > > > > capable()), but now we check them using ns_capable() instead (if BPF > > > > token is provided). See bpf_token_capable() for details. > > > > > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > > > functionality. User namespaced process has to *also* have necessary > > > > combination of capabilities inside that user namespace. So while > > > > previously CAP_BPF was useless when granted within user namespace, now > > > > it gains a meaning and allows container managers and sys admins to have > > > > a flexible control over which processes can and need to use BPF > > > > functionality within the user namespace (i.e., container in practice). > > > > And BPF FS delegation mount options and derived BPF tokens serve as > > > > a per-container "flag" to grant overall ability to use bpf() (plus further > > > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > > > > > The alternative to creating BPF token object was: > > > > a) not having any extra object and just pasing BPF FS path to each > > > > relevant bpf() command. This seems suboptimal as it's racy (mount > > > > under the same path might change in between checking it and using it > > > > for bpf() command). And also less flexible if we'd like to further > > > > restrict ourselves compared to all the delegated functionality > > > > allowed on BPF FS. > > > > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > > > > a dedicated FD that would represent a token-like functionality. This > > > > doesn't seem superior to having a proper bpf() command, so > > > > BPF_TOKEN_CREATE was chosen. > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > > > --- > > > > include/linux/bpf.h | 36 +++++++ > > > > include/uapi/linux/bpf.h | 39 +++++++ > > > > kernel/bpf/Makefile | 2 +- > > > > kernel/bpf/inode.c | 4 +- > > > > kernel/bpf/syscall.c | 17 +++ > > > > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > > > > tools/include/uapi/linux/bpf.h | 39 +++++++ > > > > 7 files changed, 324 insertions(+), 2 deletions(-) > > > > create mode 100644 kernel/bpf/token.c > > > > > > ... > > > > > > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > > > > new file mode 100644 > > > > index 000000000000..f6ea3eddbee6 > > > > --- /dev/null > > > > +++ b/kernel/bpf/token.c > > > > @@ -0,0 +1,189 @@ > > > > +#include <linux/bpf.h> > > > > +#include <linux/vmalloc.h> > > > > +#include <linux/anon_inodes.h> > > > > +#include <linux/fdtable.h> > > > > +#include <linux/file.h> > > > > +#include <linux/fs.h> > > > > +#include <linux/kernel.h> > > > > +#include <linux/idr.h> > > > > +#include <linux/namei.h> > > > > +#include <linux/user_namespace.h> > > > > + > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap) > > > > +{ > > > > + /* BPF token allows ns_capable() level of capabilities */ > > > > + if (token) { > > > > + if (ns_capable(token->userns, cap)) > > > > + return true; > > > > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > > > > + return true; > > > > + } > > > > + /* otherwise fallback to capable() checks */ > > > > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > > > > +} > > > > > > While the above looks to be equivalent to the bpf_capable() function it > > > replaces, for callers checking CAP_BPF and CAP_SYS_ADMIN, I'm looking > > > quickly at patch 3/12 and this is also being used to replace a > > > capable(CAP_NET_ADMIN) call which results in a change in behavior. > > > The current code which performs a capable(CAP_NET_ADMIN) check cannot > > > be satisfied by CAP_SYS_ADMIN, but this patchset using > > > bpf_token_capable(token, CAP_NET_ADMIN) can be satisfied by either > > > CAP_NET_ADMIN or CAP_SYS_ADMIN. > > > > > > It seems that while bpf_token_capable() can be used as a replacement > > > for bpf_capable(), it is not currently a suitable replacement for a > > > generic capable() call. Perhaps this is intentional, but I didn't see > > > it mentioned in the commit description, or in the comments, and I > > > wanted to make sure it wasn't an oversight. > > > > You are right. It is an intentional attempt to unify all such checks. > > If you look at bpf_prog_load(), we have this: > > > > if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && > > !capable(CAP_SYS_ADMIN)) > > return -EPERM; > > > > So seeing that, I realized that we did have an intent to always use > > CAP_SYS_ADMIN as a "fallback" cap, even for CAP_NET_ADMIN when it > > comes to using network-enabled BPF programs. So I decided that > > unifying all this makes sense. > > > > I'll add a comment mentioning this, I should have been more explicit > > from the get go. > > Thanks for the clarification. I'm not to worried about checking > CAP_SYS_ADMIN as a fallback, but I always get a little twitchy when I > see capability changes in the code without any mention. > > A mention in the commit description is good, and you could also draft > up a standalone patch that adds the CAP_SYS_ADMIN fallback to the > current in-tree code. That would be a good way to really highlight > the capability changes and deal with any issues that might arise > (review, odd corner cases?, etc.) prior to the BPF capability > delegation patcheset we are discussing here. Sure, sounds good, I'll add this as a pre-patch for next revision. > > > > > +#define BPF_TOKEN_INODE_NAME "bpf-token" > > > > + > > > > +/* Alloc anon_inode and FD for prepared token. > > > > + * Returns fd >= 0 on success; negative error, otherwise. > > > > + */ > > > > +int bpf_token_new_fd(struct bpf_token *token) > > > > +{ > > > > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > > > > +} > > > > + > > > > +struct bpf_token *bpf_token_get_from_fd(u32 ufd) > > > > +{ > > > > + struct fd f = fdget(ufd); > > > > + struct bpf_token *token; > > > > + > > > > + if (!f.file) > > > > + return ERR_PTR(-EBADF); > > > > + if (f.file->f_op != &bpf_token_fops) { > > > > + fdput(f); > > > > + return ERR_PTR(-EINVAL); > > > > + } > > > > + > > > > + token = f.file->private_data; > > > > + bpf_token_inc(token); > > > > + fdput(f); > > > > + > > > > + return token; > > > > +} > > > > + > > > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) > > > > +{ > > > > + if (!token) > > > > + return false; > > > > + > > > > + return token->allowed_cmds & (1ULL << cmd); > > > > +} > > > > > > I mentioned this a while back, likely in the other threads where this > > > token-based approach was only being discussed in general terms, but I > > > think we want to have a LSM hook at the point of initial token > > > delegation for this and a hook when the token is used. My initial > > > thinking is that we should be able to address the former with a hook > > > in bpf_fill_super() and the latter either in bpf_token_get_from_fd() > > > or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, > > > but it doesn't allow for much in the way of granularity. Inserting the > > > LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall > > > gracefully fallback to the system-wide checks if the LSM denied the > > > requested access whereas an access denial in bpf_token_get_from_fd() > > > denial would cause the operation to error out. > > > > I think the bpf_fill_super() LSM hook makes sense, but I thought > > someone mentioned that we already have some generic LSM hook for > > validating mounts? If we don't, I can certainly add one for BPF FS > > specifically. > > We do have security_sb_mount(), but that is a generic mount operation > access control and not well suited for controlling the mount-based > capability delegation that you are proposing here. However, if you or > someone else has a clever way to make security_sb_mount() work for > this purpose I would be very happy to review that code. To be honest, I'm a bit out of my depth here, as I don't know the mounting parts well. Perhaps someone from VFS side can advise. But regardless, I have no problem adding a new LSM hook as well, ideally not very BPF-specific. If you have a specific form of it in mind, I'd be curious to see it and implement it. > > > As for the bpf_token_allow_xxx(). This feels a bit too specific and > > narrow-focused. What if we later add yet another dimension for BPF FS > > and token? Do we need to introduce yet another LSM for each such case? > > [I'm assuming you meant new LSM *hook*] yep, of course, sorry about using terminology sloppily > > Possibly. There are also some other issues which I've been thinking > about along these lines, specifically the fact that the > capability/command delegation happens after the existing > security_bpf() hook is called which makes things rather awkward from a > LSM perspective: the LSM would first need to allow the process access > to the desired BPF op using it's current LSM specific security > attributes (e.g. SELinux security domain, etc.) and then later > consider the op in the context of the delegated access control rights > (if the LSM decides to support those hooks). > > I suspect that if we want to make this practical we would need to > either move some of the token code up into __sys_bpf() so we could > have a better interaction with security_bpf(), or we need to consider > moving the security_bpf() call into the op specific functions. I'm > still thinking on this (lots of reviews to get through this week), but > I'm hoping there is a better way because I'm not sure I like either > option very much. Yes, security_bpf() is happening extremely early and is lacking a lot of context. I'm not sure if moving it around is a good idea as it basically changes its semantics. But adding a new set of coherent LSM hooks per each appropriate BPF operation with good context to make decisions sounds like a good improvement. E.g., for BPF_PROG_LOAD, we can have LSM hook after struct bpf_prog is allocated, bpf_token is available, attributes are sanity checked. All that together is a very useful and powerful context that can be used both by more fixed LSM policies (like SELinux), and very dynamic user-defined BPF LSM programs. But I'd like to keep all that outside of the BPF token feature itself, as it's already pretty hard to get a consensus just on those bits, so complicating this with simultaneously designing a new set of LSM hooks is something that we should avoid. Let's keep discussing this, but not block that on BPF token. > > > But also see bpf_prog_load(). There are two checks, allow_prog_type > > and allow_attach_type, which are really only meaningful in > > combination. And yet you'd have to have two separate LSM hooks for > > that. > > > > So I feel like the better approach is less mechanistically > > concentrating on BPF token operations themselves, but rather on more > > semantically meaningful operations that are token-enabled. E.g., > > protect BPF program loading, BPF map creation, BTF loading, etc. And > > we do have such LSM hooks right now, though they might not be the most > > convenient. So perhaps the right move is to add new ones that would > > provide a bit more context (e.g., we can pass in the BPF token that > > was used for the operation, attributes with which map/prog was > > created, etc). Low-level token LSMs seem hard to use cohesively in > > practice, though. > > Can you elaborate a bit more? It's hard to judge the comments above > without some more specifics about hook location, parameters, etc. So something like my above proposal for a new LSM hook in BPF_PROG_LOAD command. Just right before passing bpf_prog to BPF verifier, we can have err = security_bpf_prog_load(prog, attr, token) if (err) return -EPERM; Program, attributes, and token give a lot of inputs for security policy logic to make a decision about allowing that specific BPF program to be verified and loaded or not. I know how this could be used from BPF LSM side, but I assume that SELinux and others can take advantage of that provided additional context as well. Similarly we can have a BPF_MAP_CREATE-specific LSM hook with context relevant to creating a BPF map. And so on. > > -- > paul-moore.com