On Tue, Nov 28, 2023 at 04:05:36PM -0800, Andrii Nakryiko wrote: > On Mon, Nov 27, 2023 at 11:06 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > allow delegating privileged BPF functionality, like loading a BPF > > program or creating a BPF map, from privileged process to a *trusted* > > unprivileged process, all while having a good amount of control over which > > privileged operations could be performed using provided BPF token. > > > > This is achieved through mounting BPF FS instance with extra delegation > > mount options, which determine what operations are delegatable, and also > > constraining it to the owning user namespace (as mentioned in the > > previous patch). > > > > BPF token itself is just a derivative from BPF FS and can be created > > through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF > > FS FD, which can be attained through open() API by opening BPF FS mount > > point. Currently, BPF token "inherits" delegated command, map types, > > prog type, and attach type bit sets from BPF FS as is. In the future, > > having an BPF token as a separate object with its own FD, we can allow > > to further restrict BPF token's allowable set of things either at the > > creation time or after the fact, allowing the process to guard itself > > further from unintentionally trying to load undesired kind of BPF > > programs. But for now we keep things simple and just copy bit sets as is. > > > > When BPF token is created from BPF FS mount, we take reference to the > > BPF super block's owning user namespace, and then use that namespace for > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > capabilities that are normally only checked against init userns (using > > capable()), but now we check them using ns_capable() instead (if BPF > > token is provided). See bpf_token_capable() for details. > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > functionality. User namespaced process has to *also* have necessary > > combination of capabilities inside that user namespace. So while > > previously CAP_BPF was useless when granted within user namespace, now > > it gains a meaning and allows container managers and sys admins to have > > a flexible control over which processes can and need to use BPF > > functionality within the user namespace (i.e., container in practice). > > And BPF FS delegation mount options and derived BPF tokens serve as > > a per-container "flag" to grant overall ability to use bpf() (plus further > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF) > > within the BPF FS owning user namespace, rounding up the ns_capable() > > story of BPF token. > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > include/linux/bpf.h | 41 +++++++ > > include/uapi/linux/bpf.h | 37 ++++++ > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/inode.c | 17 ++- > > kernel/bpf/syscall.c | 17 +++ > > kernel/bpf/token.c | 209 +++++++++++++++++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 37 ++++++ > > 7 files changed, 350 insertions(+), 10 deletions(-) > > create mode 100644 kernel/bpf/token.c > > > > [...] > > > +int bpf_token_create(union bpf_attr *attr) > > +{ > > + struct bpf_mount_opts *mnt_opts; > > + struct bpf_token *token = NULL; > > + struct user_namespace *userns; > > + struct inode *inode; > > + struct file *file; > > + struct path path; > > + struct fd f; > > + umode_t mode; > > + int err, fd; > > + > > + f = fdget(attr->token_create.bpffs_fd); > > + if (!f.file) > > + return -EBADF; > > + > > + path = f.file->f_path; > > + path_get(&path); > > + fdput(f); > > + > > + if (path.dentry != path.mnt->mnt_sb->s_root) { > > + err = -EINVAL; > > + goto out_path; > > + } > > + if (path.mnt->mnt_sb->s_op != &bpf_super_ops) { > > + err = -EINVAL; > > + goto out_path; > > + } > > + err = path_permission(&path, MAY_ACCESS); > > + if (err) > > + goto out_path; > > + > > + userns = path.dentry->d_sb->s_user_ns; > > + /* > > + * Enforce that creators of BPF tokens are in the same user > > + * namespace as the BPF FS instance. This makes reasoning about > > + * permissions a lot easier and we can always relax this later. > > + */ > > + if (current_user_ns() != userns) { > > + err = -EPERM; > > + goto out_path; > > + } I should note that the reason I'm saying it makes reasoning about permissions easier is that this here guarantees that: file->f_cred->user_ns == file->f_path.dentry->d_sb->s_user_ns So cases where you would need to check that you have permissions in the openers userns are equivalent to checking permissions in the token's and therefore bpffs' userns. > > Hey Christian, > > I've added stricter userns check as discussed on previous revision, > and a few lines above fixed BPF FS root check (path.dentry != > path.mnt->mnt_sb->s_root). Hopefully that addresses the remaining > concerns you've had. > > I'd appreciate it if you could take another look to double check if > I'm not messing anything up, and if it all looks good, can I please > get an ack from you? Thank you! I'll take a look.